推理输出#
vLLM 为推理模型提供支持,例如 DeepSeek R1,这些模型旨在生成包含推理步骤和最终结论的输出。
推理模型在其输出中返回一个额外的 reasoning_content
字段,其中包含导致最终结论的推理步骤。此字段在其他模型的输出中不存在。
支持的模型#
vLLM 目前支持以下推理模型
DeepSeek R1 系列 (
deepseek_r1
,它查找<think> ... </think>
)
快速入门#
要使用推理模型,您需要在向聊天完成端点发出请求时指定 --enable-reasoning
和 --reasoning-parser
标志。--reasoning-parser
标志指定用于从模型输出中提取推理内容的推理解析器。
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--enable-reasoning --reasoning-parser deepseek_r1
接下来,向应该在响应中返回推理内容的模型发出请求。
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://127.0.0.1:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
# Round 1
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(model=model, messages=messages)
reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content
print("reasoning_content:", reasoning_content)
print("content:", content)
reasoning_content
字段包含导致最终结论的推理步骤,而 content
字段包含最终结论。
流式聊天完成#
推理模型也支持流式聊天完成。reasoning_content
字段在聊天完成响应块的 delta
字段中可用。
{
"id": "chatcmpl-123",
"object": "chat.completion.chunk",
"created": 1694268190,
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
"system_fingerprint": "fp_44709d6fcb",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"reasoning_content": "is",
},
"logprobs": null,
"finish_reason": null
}
]
}
请注意,它与 OpenAI Python 客户端库不兼容。您可以使用 requests
库发出流式请求。
如何支持新的推理模型#
您可以添加一个新的 ReasoningParser
,类似于 vllm/entrypoints/openai/reasoning_parsers/deepseek_r1_reasoning_parser.py
。
# import the required packages
from vllm.entrypoints.openai.reasoning_parsers.abs_reasoning_parsers import (
ReasoningParser, ReasoningParserManager)
from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
DeltaMessage)
# define a reasoning parser and register it to vllm
# the name list in register_module can be used
# in --reasoning-parser.
@ReasoningParserManager.register_module(["example"])
class ExampleParser(ReasoningParser):
def __init__(self, tokenizer: AnyTokenizer):
super().__init__(tokenizer)
def extract_reasoning_content_streaming(
self,
previous_text: str,
current_text: str,
delta_text: str,
previous_token_ids: Sequence[int],
current_token_ids: Sequence[int],
delta_token_ids: Sequence[int],
) -> Union[DeltaMessage, None]:
"""
Instance method that should be implemented for extracting reasoning
from an incomplete response; for use when handling reasoning calls and
streaming. Has to be an instance method because it requires state -
the current tokens/diffs, but also the information about what has
previously been parsed and extracted (see constructor)
"""
def extract_reasoning_content(
self, model_output: str, request: ChatCompletionRequest
) -> Tuple[Optional[str], Optional[str]]:
"""
Extract reasoning content from a complete model-generated string.
Used for non-streaming responses where we have the entire model response
available before sending to the client.
Parameters:
model_output: str
The model-generated string to extract reasoning content from.
request: ChatCompletionRequest
The request object that was used to generate the model_output.
Returns:
Tuple[Optional[str], Optional[str]]
A tuple containing the reasoning content and the content.
"""
定义推理解析器后,您可以通过在向聊天完成端点发出请求时指定 --reasoning-parser
标志来使用它。
vllm serve <model_tag> \
--enable-reasoning --reasoning-parser example
局限性#
推理内容仅适用于在线服务的聊天完成端点 (
/v1/chat/completions
)。它与
structured_outputs
和tool_calling
功能不兼容。推理内容并非适用于所有模型。请查看模型的文档以了解其是否支持推理。