DeepSeek-V3.2 使用指南¶
简介¶
DeepSeek-V3.2 是一个通过三项技术创新实现了计算效率与强大推理和 Agent 能力平衡的模型:- DeepSeek 稀疏注意力 (DSA):一种高效的注意力机制,在降低计算复杂度的同时保持性能,针对长上下文场景进行了优化。- 可扩展的强化学习框架:该模型通过强大的 RL 协议和可扩展的训练后计算实现了 GPT-5 级别的性能。高计算量变体 DeepSeek-V3.2-Speciale 超越了 GPT-5,并在推理方面与 Gemini-3.0-Pro 相当,在 2025 年 IMO 和 IOI 竞赛中取得了金牌水平的性能。- 大规模 Agent 任务合成管道:一种新颖的数据合成管道,可大规模生成训练数据,将推理集成到工具使用场景中,并提高模型在复杂交互环境中的合规性和泛化能力。
安装 vLLM¶
uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
启动 DeepSeek-V3.2¶
- DeepSeek-V3.2 的 chat-template 变化很大。vLLM 通过
--tokenizer-mode deepseek_v32来适应这一点。
vllm serve deepseek-ai/DeepSeek-V3.2 \
--tensor-parallel-size 8 \
--tokenizer-mode deepseek_v32 \
--tool-call-parser deepseek_v32 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v3
准确性基准测试¶
GSM8K¶
- 脚本
lm_eval --model local-completions --model_args "model=deepseek-ai/DeepSeek-V3.2,base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5
- 结果
local-completions (model=deepseek-ai/DeepSeek-V3.2,base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9560|± |0.0056|
| | |strict-match | 5|exact_match|↑ |0.9553|± |0.0057|
AIME25¶
- 脚本
lm_eval --model local-chat-completions --model_args "model=deepseek-ai/DeepSeek-V3.2,base_url=http://0.0.0.0:8000/v1/chat/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=20,timeout=5000,max_length=72768" --tasks aime25 --apply_chat_template --gen_kwargs '{"temperature":1.0,"max_gen_toks":72768,"top_p":0.95,"chat_template_kwargs":{"thinking":true}}' --log_samples --output_path "aime25_ds32"
- 结果
local-chat-completions (model=deepseek-ai/DeepSeek-V3.2,base_url=http://0.0.0.0:8000/v1/chat/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=20,timeout=5000,max_length=72768), gen_kwargs: ({'temperature': 1.0, 'max_gen_toks': 72768, 'top_p': 0.95, 'chat_template_kwargs': {'thinking': True}}), limit: None, num_fewshot: None, batch_size: 1
|Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------|------:|------|-----:|-----------|---|-----:|---|-----:|
|aime25| 0|none | 0|exact_match|↑ |0.9333|± |0.0463|
基准测试¶
我们使用以下脚本在 8xH20 上对 deepseek-ai/DeepSeek-V3.2 进行了基准测试。
vllm bench serve \
--model deepseek-ai/DeepSeek-V3.2 \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--request-rate 10 \
--num-prompt 100 \
--trust-remote-code
TP8 基准测试输出¶
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Request rate configured (RPS): 10.00
Benchmark duration (s): 129.34
Total input tokens: 204800
Total generated tokens: 102400
Request throughput (req/s): 0.77
Output token throughput (tok/s): 791.73
Peak output token throughput (tok/s): 1300.00
Peak concurrent requests: 100.00
Total Token throughput (tok/s): 2375.18
---------------Time to First Token----------------
Mean TTFT (ms): 21147.20
Median TTFT (ms): 21197.97
P99 TTFT (ms): 41133.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 99.71
Median TPOT (ms): 99.25
P99 TPOT (ms): 124.28
---------------Inter-token Latency----------------
Mean ITL (ms): 99.71
Median ITL (ms): 76.89
P99 ITL (ms): 2032.37
==================================================
使用技巧¶
-
您可以参考 DeepSeek-V3_2-Exp 示例 和 数据并行部署文档 来进行相关的实验和基准测试,以选择适合您场景的并行组。
-
关于
thinking mode(思考模式)和non-thinking mode(非思考模式),您可以参考 DeepSeek-V3_1 示例。
工具调用示例¶
DeepSeek 3.2 的思考模式现在支持工具调用,请参阅:DeepSeek API 文档。该模型可以在输出最终答案之前进行多轮推理和工具调用。下面的代码示例直接从 DeepSeek 官方示例中复制。对于 vLLM,主要修改如下:
-
要在 vLLM 中启用思考模式,请使用 extra_body = {"chat_template_kwargs": {"thinking": True}}。在 DeepSeek 官方 API 中,启用思考模式的方法是 extra_body = {"thinking": {"type": "enabled"}}。
-
对于
think字段,vLLM 建议使用 reasoning,DeepSeek 官方 API 使用 reasoning_content。 -
在 vLLM 中,如果没有 tool_calls,则 tool_calls 为一个空列表 (
[])。相反,DeepSeek 官方 API 返回None。
import os
import json
from openai import OpenAI
# The definition of the tools
tools = [
{
"type": "function",
"function": {
"name": "get_date",
"description": "Get the current date",
"parameters": { "type": "object", "properties": {} },
}
},
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather of a location, the user should supply the location and date.",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string", "description": "The city name" },
"date": { "type": "string", "description": "The date in format YYYY-mm-dd" },
},
"required": ["location", "date"]
},
}
},
]
# The mocked version of the tool calls
def get_date_mock():
return "2025-12-01"
def get_weather_mock(location, date):
return "Cloudy 7~13°C"
TOOL_CALL_MAP = {
"get_date": get_date_mock,
"get_weather": get_weather_mock
}
def clear_reasoning_content(messages):
for message in messages:
# DeepSeek official API
# if hasattr(message, 'reasoning_content'):
# message.reasoning_content = None
# vLLM Server
if hasattr(message, 'reasoning'):
message.reasoning = None
def run_turn(turn, messages):
sub_turn = 1
while True:
response = client.chat.completions.create(
model='deepseek-chat',
messages=messages,
tools=tools,
# extra_body={ "thinking": { "type": "enabled" } } # DeepSeek official API
extra_body = {"chat_template_kwargs": {"thinking": True}} # vLLM Server
)
messages.append(response.choices[0].message)
# DeepSeek API
# reasoning_content = response.choices[0].message.reasoning_content
# vLLM Server
reasoning_content = response.choices[0].message.reasoning
content = response.choices[0].message.content
tool_calls = response.choices[0].message.tool_calls
print(f"Turn {turn}.{sub_turn}\n{reasoning_content=}\n{content=}\n{tool_calls=}")
# If there is no tool calls, then the model should get a final answer and we need to stop the loop
# In DeepSeek API, if there are no tool_calls, then tool_calls is None.
#if tool_calls is None:
# In vLLM, if there are no tool_calls, then tool_calls is [].
if not tool_calls:
break
for tool in tool_calls:
tool_function = TOOL_CALL_MAP[tool.function.name]
tool_result = tool_function(**json.loads(tool.function.arguments))
print(f"tool result for {tool.function.name}: {tool_result}\n")
messages.append({
"role": "tool",
"tool_call_id": tool.id,
"content": tool_result,
})
sub_turn += 1
# You can running vLLM server using the following command
# vllm serve serve deepseek-ai/DeepSeek-V3.2 \
# --tensor-parallel-size 8 \
# --tokenizer-mode deepseek_v32 \
# --tool-call-parser deepseek_v32 \
# --enable-auto-tool-choice \
# --reasoning-parser deepseek_v3
client = OpenAI(
api_key=os.environ.get('DEEPSEEK_API_KEY'),
base_url=os.environ.get('DEEPSEEK_BASE_URL'),
)
# The user starts a question
turn = 1
messages = [{
"role": "user",
"content": "How's the weather in Hangzhou Tomorrow"
}]
run_turn(turn, messages)
# The user starts a new question
turn = 2
messages.append({
"role": "user",
"content": "How's the weather in Hangzhou Tomorrow"
})
# We recommended to clear the reasoning_content in history messages so as to save network bandwidth
clear_reasoning_content(messages)
run_turn(turn, messages)
vLLM 服务器输出¶
Turn 1.1
reasoning_content="I need to help the user with weather in Hangzhou tomorrow. First, I need to get the current date to determine tomorrow's date. Then I can use the weather function. Let me start by getting the current date."
content=None
tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-a2de4337498c482c', function=Function(arguments='{}', name='get_date'), type='function')]
tool result for get_date: 2025-12-01
Turn 1.2
reasoning_content='Today is December 1, 2025. Tomorrow would be December 2, 2025. Now I can get the weather for Hangzhou for that date.'
content=None
tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-b11e7a47d3b689ea', function=Function(arguments='{"location": "Hangzhou", "date": "2025-12-02"}', name='get_weather'), type='function')]
tool result for get_weather: Cloudy 7~13°C
Turn 1.3
reasoning_content="I have the weather information: Cloudy with temperatures between 7°C and 13°C. I should provide this to the user in a clear and friendly manner. I'll mention that this is for tomorrow, December 2, 2025. Let me craft the response."
content='The weather in Hangzhou **tomorrow, Tuesday, December 2, 2025**, will be **Cloudy** with temperatures ranging from **7°C to 13°C**.'
tool_calls=[]
Turn 2.1
reasoning_content='The user is asking about the weather in Hangzhou tomorrow again. I already answered this question in the previous exchange, but I should check if "tomorrow" still refers to the same date or if there\'s a new context. The current date is December 1, 2025, so tomorrow would be December 2, 2025. I already provided that information. However, maybe the user is asking again because they want to confirm or maybe they didn\'t see the previous answer? Looking at the conversation, I provided the weather for tomorrow (December 2, 2025). The user\'s latest question is identical to the first one. I should probably respond with the same information, but perhaps acknowledge that I already provided this information. However, since the conversation continues, maybe they want additional details or something else? The weather tool only gives basic info: "Cloudy 7~13°C". I could present it again. But maybe the user expects a different format or more details? I could just repeat the answer. Let me respond politely with the same information.'
content="The weather in Hangzhou **tomorrow, Tuesday, December 2, 2025**, will be **Cloudy** with temperatures ranging from **7°C to 13°C**. \n\nThis is the same forecast I provided earlier - it looks like tomorrow's weather will be consistently cloudy with cool temperatures."
tool_calls=[]
DeepSeek 官方 API 输出¶
Turn 1.1
reasoning_content="The user is asking about the weather in Hangzhou tomorrow. I need to get the current date to determine what tomorrow's date is, then use that to get the weather forecast. Let me first get the current date."
content=''
tool_calls=[ChatCompletionMessageFunctionToolCall(id='call_00_OOAEfTpXddWI9rgC75bfYQJY', function=Function(arguments='{}', name='get_date'), type='function', index=0)]
tool result for get_date: 2025-12-01
Turn 1.2
reasoning_content='Today is December 1, 2025. Tomorrow would be December 2, 2025. So I need to get the weather for Hangzhou on 2025-12-02. Now I can call get_weather with location Hangzhou and date 2025-12-02.'
content=''
tool_calls=[ChatCompletionMessageFunctionToolCall(id='call_00_3P0Xqw5MrVhklmQ4QSACbDq6', function=Function(arguments='{"location": "Hangzhou", "date": "2025-12-02"}', name='get_weather'), type='function', index=0)]
tool result for get_weather: Cloudy 7~13°C
Turn 1.3
reasoning_content='Now I have the weather information: Cloudy with temperatures between 7 and 13 degrees Celsius. I should provide this to the user in a friendly manner. I can mention that tomorrow is December 2nd, and give the forecast. Let me craft the response.'
content='Tomorrow (December 2, 2025) in Hangzhou, the weather will be **cloudy** with temperatures ranging from **7°C to 13°C**.'
tool_calls=None
Turn 2.1
reasoning_content='The user is asking about the weather in Hangzhou tomorrow. I already answered this question in the previous interaction. However, I should check if "tomorrow" is still the same date. The current date is 2025-12-01. Tomorrow would be 2025-12-02. I already provided the weather for that date: Cloudy 7~13°C. \n\nBut wait, the user might be asking again, perhaps not noticing the previous answer. Or maybe they want a different presentation. I should answer again, but maybe with a slightly different phrasing. Also, I should confirm that "tomorrow" is indeed 2025-12-02.\n\nI could just repeat the information. But perhaps I should check if the date has changed? The current date is still 2025-12-01. So tomorrow is still 2025-12-02. I already have the weather data.\n\nI\'ll respond with the weather information again.'
content='Based on the previous query, tomorrow (December 2, 2025) in Hangzhou will be **cloudy** with temperatures between **7°C and 13°C**.'
tool_calls=None