Qwen3-Coder 使用指南¶

Qwen3-Coder 是阿里巴巴达摩院 Qwen 团队开发的先进大型语言模型。vLLM 已支持 Qwen3-Coder，并且 tool-call 功能将在 vLLM v0.10.0 及更高版本中提供。您可以使用以下方法安装支持 tool-call 的 vLLM：

安装 vLLM¶

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

使用 vLLM 启动 Qwen3-Coder¶

在 8xH200 (或 H20) GPU 上提供服务 (141GB × 8)¶

BF16 模型

vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct \
  --max-model-len 32000 \
  --enable-expert-parallel \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

FP8 模型

VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --max-model-len 131072 \
  --enable-expert-parallel \
  --data-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

性能指标¶

评估¶

我们使用 vLLM 启动了 Qwen3-Coder-480B-A35B-Instruct-FP8，并使用 EvalPlus 对其性能进行了评估。结果显示如下：

数据集	测试类型	Pass@1 分数
HumanEval	基础测试	0.939
HumanEval+	基础 + 附加测试	0.902
MBPP	基础测试	0.918
MBPP+	基础 + 附加测试	0.794

基准测试¶

我们使用以下脚本对 Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 进行了基准测试：

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100 \

如果成功，您将看到以下输出。

============ Serving Benchmark Result ============
Successful requests:                     100
Benchmark duration (s):                  776.49
Total input tokens:                      204169
Total generated tokens:                  102400
Request throughput (req/s):              0.13
Output token throughput (tok/s):         131.88
Total Token throughput (tok/s):          394.81
---------------Time to First Token----------------
Mean TTFT (ms):                          7639.31
Median TTFT (ms):                        6935.71
P99 TTFT (ms):                           13766.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.43
Median TPOT (ms):                        67.23
P99 TPOT (ms):                           72.14
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.43
Median ITL (ms):                         66.34
P99 ITL (ms):                            69.38
==================================================

使用技巧¶

BF16 模型¶

上下文长度限制：单个 H20 节点无法处理原始上下文长度（262144）。您可以减小 max-model-len 或增加 gpu-memory-utilization 以在内存限制内工作。

FP8 模型¶

上下文长度限制：单个 H20 节点无法处理原始上下文长度（262144）。您可以减小 max-model-len 或增加 gpu-memory-utilization 以在内存限制内工作。
DeepGEMM 使用：要使用 DeepGEMM，请设置 VLLM_USE_DEEP_GEMM=1。请遵循设置说明进行安装。
张量并行问题：在使用 tensor-parallel-size 8 时，预期会出现以下失败。请切换到数据并行模式，使用 --data-parallel-size。
附加资源：有关更多并行组，请参阅数据并行部署文档。

ERROR [multiproc_executor.py:511]   File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 336, in <lambda>
ERROR [multiproc_executor.py:511]     lambda prefix: Qwen3MoeDecoderLayer(config=config,
ERROR [multiproc_executor.py:511]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:511]   File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 278, in __init__
ERROR [multiproc_executor.py:511]     self.mlp = Qwen3MoeSparseMoeBlock(config=config,
ERROR [multiproc_executor.py:511]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:511]   File "/vllm/vllm/model_executor/models/qwen3_moe.py", line 113, in __init__
ERROR [multiproc_executor.py:511]     self.experts = FusedMoE(num_experts=config.num_experts,
ERROR [multiproc_executor.py:511]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR [multiproc_executor.py:511]   File "/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 773, in __init__
ERROR [multiproc_executor.py:511]     self.quant_method.create_weights(layer=self, **moe_quant_params)
ERROR [multiproc_executor.py:511]   File "/vllm/vllm/model_executor/layers/quantization/fp8.py", line 573, in create_weights
ERROR [multiproc_executor.py:511]     raise ValueError(
ERROR [multiproc_executor.py:511] ValueError: The output_size of gate's and up's weight = 320 is not divisible by weight quantization block_n = 128.

工具调用¶

启用工具调用：添加 --tool-call-parser qwen3_coder 以启用工具调用解析功能，请参阅：tool_calling

路线图¶

[x] 添加基准测试结果