跳到内容

DeepSeek-V3.2 使用指南

简介

DeepSeek-V3.2 是一个通过三项技术创新实现了计算效率与强大推理和 Agent 能力平衡的模型:- DeepSeek 稀疏注意力 (DSA):一种高效的注意力机制,在降低计算复杂度的同时保持性能,针对长上下文场景进行了优化。- 可扩展的强化学习框架:该模型通过强大的 RL 协议和可扩展的训练后计算实现了 GPT-5 级别的性能。高计算量变体 DeepSeek-V3.2-Speciale 超越了 GPT-5,并在推理方面与 Gemini-3.0-Pro 相当,在 2025 年 IMO 和 IOI 竞赛中取得了金牌水平的性能。- 大规模 Agent 任务合成管道:一种新颖的数据合成管道,可大规模生成训练数据,将推理集成到工具使用场景中,并提高模型在复杂交互环境中的合规性和泛化能力。

安装 vLLM

uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly

启动 DeepSeek-V3.2

  • DeepSeek-V3.2 的 chat-template 变化很大。vLLM 通过 --tokenizer-mode deepseek_v32 来适应这一点。
  vllm serve deepseek-ai/DeepSeek-V3.2 \
   --tensor-parallel-size 8 \
   --tokenizer-mode deepseek_v32 \
   --tool-call-parser deepseek_v32 \
   --enable-auto-tool-choice \
   --reasoning-parser deepseek_v3

准确性基准测试

GSM8K

  • 脚本
lm_eval --model local-completions --model_args "model=deepseek-ai/DeepSeek-V3.2,base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" --tasks gsm8k --num_fewshot 5
  • 结果
local-completions (model=deepseek-ai/DeepSeek-V3.2,base_url=http://0.0.0.0:8000/v1/completions,max_length=8192,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|  |0.9560|±  |0.0056|
|     |       |strict-match    |     5|exact_match|  |0.9553|±  |0.0057|

AIME25

  • 脚本
lm_eval --model local-chat-completions --model_args "model=deepseek-ai/DeepSeek-V3.2,base_url=http://0.0.0.0:8000/v1/chat/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=20,timeout=5000,max_length=72768" --tasks aime25 --apply_chat_template --gen_kwargs '{"temperature":1.0,"max_gen_toks":72768,"top_p":0.95,"chat_template_kwargs":{"thinking":true}}' --log_samples --output_path "aime25_ds32"    
  • 结果
local-chat-completions (model=deepseek-ai/DeepSeek-V3.2,base_url=http://0.0.0.0:8000/v1/chat/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=20,timeout=5000,max_length=72768), gen_kwargs: ({'temperature': 1.0, 'max_gen_toks': 72768, 'top_p': 0.95, 'chat_template_kwargs': {'thinking': True}}), limit: None, num_fewshot: None, batch_size: 1
|Tasks |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|------|------:|------|-----:|-----------|---|-----:|---|-----:|
|aime25|      0|none  |     0|exact_match|  |0.9333|±  |0.0463|

基准测试

我们使用以下脚本在 8xH20 上对 deepseek-ai/DeepSeek-V3.2 进行了基准测试。

vllm bench serve \
  --model deepseek-ai/DeepSeek-V3.2 \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --request-rate 10 \
  --num-prompt 100  \ 
  --trust-remote-code

TP8 基准测试输出

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  129.34    
Total input tokens:                      204800    
Total generated tokens:                  102400    
Request throughput (req/s):              0.77      
Output token throughput (tok/s):         791.73    
Peak output token throughput (tok/s):    1300.00   
Peak concurrent requests:                100.00    
Total Token throughput (tok/s):          2375.18   
---------------Time to First Token----------------
Mean TTFT (ms):                          21147.20  
Median TTFT (ms):                        21197.97  
P99 TTFT (ms):                           41133.00  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          99.71     
Median TPOT (ms):                        99.25     
P99 TPOT (ms):                           124.28    
---------------Inter-token Latency----------------
Mean ITL (ms):                           99.71     
Median ITL (ms):                         76.89     
P99 ITL (ms):                            2032.37   
==================================================

使用技巧

工具调用示例

DeepSeek 3.2 的思考模式现在支持工具调用,请参阅:DeepSeek API 文档。该模型可以在输出最终答案之前进行多轮推理和工具调用。下面的代码示例直接从 DeepSeek 官方示例中复制。对于 vLLM,主要修改如下:

  • 要在 vLLM 中启用思考模式,请使用 extra_body = {"chat_template_kwargs": {"thinking": True}}。在 DeepSeek 官方 API 中,启用思考模式的方法是 extra_body = {"thinking": {"type": "enabled"}}

  • 对于 think 字段,vLLM 建议使用 reasoning,DeepSeek 官方 API 使用 reasoning_content

  • 在 vLLM 中,如果没有 tool_calls,则 tool_calls 为一个空列表 ([])。相反,DeepSeek 官方 API 返回 None

import os
import json
from openai import OpenAI

# The definition of the tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_date",
            "description": "Get the current date",
            "parameters": { "type": "object", "properties": {} },
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather of a location, the user should supply the location and date.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": { "type": "string", "description": "The city name" },
                    "date": { "type": "string", "description": "The date in format YYYY-mm-dd" },
                },
                "required": ["location", "date"]
            },
        }
    },
]

# The mocked version of the tool calls
def get_date_mock():
    return "2025-12-01"

def get_weather_mock(location, date):
    return "Cloudy 7~13°C"

TOOL_CALL_MAP = {
    "get_date": get_date_mock,
    "get_weather": get_weather_mock
}

def clear_reasoning_content(messages):
    for message in messages:

        # DeepSeek official API
        # if hasattr(message, 'reasoning_content'):
        #     message.reasoning_content = None

        #  vLLM Server
        if hasattr(message, 'reasoning'):
            message.reasoning = None

def run_turn(turn, messages):
    sub_turn = 1
    while True:
        response = client.chat.completions.create(
            model='deepseek-chat',
            messages=messages,
            tools=tools,
            # extra_body={ "thinking": { "type": "enabled" } } # DeepSeek official API
            extra_body = {"chat_template_kwargs": {"thinking": True}} # vLLM Server
        )
        messages.append(response.choices[0].message)
        #  DeepSeek API
        # reasoning_content = response.choices[0].message.reasoning_content
        # vLLM Server
        reasoning_content = response.choices[0].message.reasoning
        content = response.choices[0].message.content
        tool_calls = response.choices[0].message.tool_calls
        print(f"Turn {turn}.{sub_turn}\n{reasoning_content=}\n{content=}\n{tool_calls=}")
        # If there is no tool calls, then the model should get a final answer and we need to stop the loop
        # In DeepSeek API, if there are no tool_calls, then tool_calls is None.
        #if tool_calls is None:
        # In vLLM, if there are no tool_calls, then tool_calls is [].
        if not tool_calls:
            break
        for tool in tool_calls:
            tool_function = TOOL_CALL_MAP[tool.function.name]
            tool_result = tool_function(**json.loads(tool.function.arguments))
            print(f"tool result for {tool.function.name}: {tool_result}\n")
            messages.append({
                "role": "tool",
                "tool_call_id": tool.id,
                "content": tool_result,
            })
        sub_turn += 1

# You can running vLLM server using the following command
#   vllm serve serve deepseek-ai/DeepSeek-V3.2 \
#   --tensor-parallel-size 8 \
#   --tokenizer-mode deepseek_v32 \
#   --tool-call-parser deepseek_v32 \
#   --enable-auto-tool-choice \
#   --reasoning-parser deepseek_v3

client = OpenAI(
    api_key=os.environ.get('DEEPSEEK_API_KEY'),
    base_url=os.environ.get('DEEPSEEK_BASE_URL'),
)

# The user starts a question
turn = 1
messages = [{
    "role": "user",
    "content": "How's the weather in Hangzhou Tomorrow"
}]
run_turn(turn, messages)

# The user starts a new question
turn = 2
messages.append({
    "role": "user",
    "content": "How's the weather in Hangzhou Tomorrow"
})
# We recommended to clear the reasoning_content in history messages so as to save network bandwidth
clear_reasoning_content(messages)
run_turn(turn, messages)

vLLM 服务器输出

Turn 1.1
reasoning_content="I need to help the user with weather in Hangzhou tomorrow. First, I need to get the current date to determine tomorrow's date. Then I can use the weather function. Let me start by getting the current date."
content=None
tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-a2de4337498c482c', function=Function(arguments='{}', name='get_date'), type='function')]
tool result for get_date: 2025-12-01

Turn 1.2
reasoning_content='Today is December 1, 2025. Tomorrow would be December 2, 2025. Now I can get the weather for Hangzhou for that date.'
content=None
tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-b11e7a47d3b689ea', function=Function(arguments='{"location": "Hangzhou", "date": "2025-12-02"}', name='get_weather'), type='function')]
tool result for get_weather: Cloudy 7~13°C

Turn 1.3
reasoning_content="I have the weather information: Cloudy with temperatures between 7°C and 13°C. I should provide this to the user in a clear and friendly manner. I'll mention that this is for tomorrow, December 2, 2025. Let me craft the response."
content='The weather in Hangzhou **tomorrow, Tuesday, December 2, 2025**, will be **Cloudy** with temperatures ranging from **7°C to 13°C**.'
tool_calls=[]
Turn 2.1
reasoning_content='The user is asking about the weather in Hangzhou tomorrow again. I already answered this question in the previous exchange, but I should check if "tomorrow" still refers to the same date or if there\'s a new context. The current date is December 1, 2025, so tomorrow would be December 2, 2025. I already provided that information. However, maybe the user is asking again because they want to confirm or maybe they didn\'t see the previous answer? Looking at the conversation, I provided the weather for tomorrow (December 2, 2025). The user\'s latest question is identical to the first one. I should probably respond with the same information, but perhaps acknowledge that I already provided this information. However, since the conversation continues, maybe they want additional details or something else? The weather tool only gives basic info: "Cloudy 7~13°C". I could present it again. But maybe the user expects a different format or more details? I could just repeat the answer. Let me respond politely with the same information.'
content="The weather in Hangzhou **tomorrow, Tuesday, December 2, 2025**, will be **Cloudy** with temperatures ranging from **7°C to 13°C**. \n\nThis is the same forecast I provided earlier - it looks like tomorrow's weather will be consistently cloudy with cool temperatures."
tool_calls=[]

DeepSeek 官方 API 输出

Turn 1.1
reasoning_content="The user is asking about the weather in Hangzhou tomorrow. I need to get the current date to determine what tomorrow's date is, then use that to get the weather forecast. Let me first get the current date."
content=''
tool_calls=[ChatCompletionMessageFunctionToolCall(id='call_00_OOAEfTpXddWI9rgC75bfYQJY', function=Function(arguments='{}', name='get_date'), type='function', index=0)]
tool result for get_date: 2025-12-01

Turn 1.2
reasoning_content='Today is December 1, 2025. Tomorrow would be December 2, 2025. So I need to get the weather for Hangzhou on 2025-12-02. Now I can call get_weather with location Hangzhou and date 2025-12-02.'
content=''
tool_calls=[ChatCompletionMessageFunctionToolCall(id='call_00_3P0Xqw5MrVhklmQ4QSACbDq6', function=Function(arguments='{"location": "Hangzhou", "date": "2025-12-02"}', name='get_weather'), type='function', index=0)]
tool result for get_weather: Cloudy 7~13°C

Turn 1.3
reasoning_content='Now I have the weather information: Cloudy with temperatures between 7 and 13 degrees Celsius. I should provide this to the user in a friendly manner. I can mention that tomorrow is December 2nd, and give the forecast. Let me craft the response.'
content='Tomorrow (December 2, 2025) in Hangzhou, the weather will be **cloudy** with temperatures ranging from **7°C to 13°C**.'
tool_calls=None
Turn 2.1
reasoning_content='The user is asking about the weather in Hangzhou tomorrow. I already answered this question in the previous interaction. However, I should check if "tomorrow" is still the same date. The current date is 2025-12-01. Tomorrow would be 2025-12-02. I already provided the weather for that date: Cloudy 7~13°C. \n\nBut wait, the user might be asking again, perhaps not noticing the previous answer. Or maybe they want a different presentation. I should answer again, but maybe with a slightly different phrasing. Also, I should confirm that "tomorrow" is indeed 2025-12-02.\n\nI could just repeat the information. But perhaps I should check if the date has changed? The current date is still 2025-12-01. So tomorrow is still 2025-12-02. I already have the weather data.\n\nI\'ll respond with the weather information again.'
content='Based on the previous query, tomorrow (December 2, 2025) in Hangzhou will be **cloudy** with temperatures between **7°C and 13°C**.'
tool_calls=None