跳到内容

Kimi-K2-Thinking 使用指南

Kimi K2 Thinking 是由 moonshotai 创建的一个先进的大型语言模型。它具有以下亮点:

  • 深度思考 & 工具编排:端到端训练,能够将思维链推理与函数调用交织在一起,从而实现持续数百步而不会产生漂移的自主研究、编码和写作工作流。
  • 原生 INT4 量化:在训练后阶段采用量化感知训练 (QAT),在低延迟模式下实现无损 2 倍速度提升。
  • 稳定的长程代理能力:在长达 200-300 次连续工具调用中保持连贯的目标导向行为,超越了在 30-50 步后性能会下降的先前模型。

安装 vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

使用 vLLM 启动 Kimi-K2-Thinking

您可以使用 8 个 H200/H20 来启动此模型。有关低延迟和高吞吐量场景的详细启动参数,请参阅下面的章节。

低延迟场景像这样运行张量并行
vllm serve moonshotai/Kimi-K2-Thinking \
  --tensor-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2  \
  --trust-remote-code
`--reasoning-parser` 标志指定了用于从模型输出中提取推理内容的推理解析器。
高吞吐量场景vLLM 支持 [Decode Context Parallel](https://docs.vllm.com.cn/en/latest/serving/context_parallel_deployment.html#decode-context-parallel),在高吞吐量场景中具有显著优势。您可以通过添加 `--decode-context-parallel-size number` 来启用 DCP,例如:
vllm serve moonshotai/Kimi-K2-Thinking \
  --tensor-parallel-size 8 \
  --decode-context-parallel-size 8 \
  --enable-auto-tool-choice \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2  \
  --trust-remote-code
`--reasoning-parser` 标志指定了用于从模型输出中提取推理内容的推理解析器。

指标

我们测试了两种启动脚本(TP8 vs TP8+DCP8)的 GSM8K 准确性。

  • TP8
local-completions (model=moonshotai/Kimi-K2-Thinking,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|  |0.9416|±  |0.0065|
|     |       |strict-match    |     5|exact_match|  |0.9409|±  |0.0065|
  • TP8+DCP8
local-completions (model=moonshotai/Kimi-K2-Thinking,temperature=0.0,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32,timeout=3000), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|  |0.9386|±  |0.0066|
|     |       |strict-match    |     5|exact_match|  |0.9371|±  |0.0067|

基准测试

我们使用以下脚本在 8*H200 上对 moonshotai/Kimi-K2-Thinking 进行了基准测试。

vllm bench serve \
  --model moonshotai/Kimi-K2-Thinking \
  --dataset-name random \
  --random-input 8000 \
  --random-output 4000 \
  --request-rate 100 \
  --num-prompt 1000  \
  --trust-remote-code

我们分别对 TP8 和 TP8+DCP8 的性能进行了基准测试。

TP8 基准测试输出

============ Serving Benchmark Result ============
Successful requests:                     998       
Failed requests:                         2         
Request rate configured (RPS):           100.00    
Benchmark duration (s):                  800.26    
Total input tokens:                      7984000   
Total generated tokens:                  388750    
Request throughput (req/s):              1.25      
Output token throughput (tok/s):         485.78    
Peak output token throughput (tok/s):    2100.00   
Peak concurrent requests:                988.00    
Total Token throughput (tok/s):          10462.57  
---------------Time to First Token----------------
Mean TTFT (ms):                          271196.67 
Median TTFT (ms):                        227389.87 
P99 TTFT (ms):                           686294.46 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          381.29    
Median TPOT (ms):                        473.04    
P99 TPOT (ms):                           578.64    
---------------Inter-token Latency----------------
Mean ITL (ms):                           111.22    
Median ITL (ms):                         38.04     
P99 ITL (ms):                            490.93    
==================================================

TP8+DCP8 基准测试输出

============ Serving Benchmark Result ============
Successful requests:                     994       
Failed requests:                         6         
Request rate configured (RPS):           100.00    
Benchmark duration (s):                  631.35    
Total input tokens:                      7952000   
Total generated tokens:                  438872    
Request throughput (req/s):              1.57      
Output token throughput (tok/s):         695.13    
Peak output token throughput (tok/s):    2618.00   
Peak concurrent requests:                988.00    
Total Token throughput (tok/s):          13290.35  
---------------Time to First Token----------------
Mean TTFT (ms):                          227780.13 
Median TTFT (ms):                        227055.20 
P99 TTFT (ms):                           451255.55 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          398.60    
Median TPOT (ms):                        472.81    
P99 TPOT (ms):                           569.91    
---------------Inter-token Latency----------------
Mean ITL (ms):                           104.73    
Median ITL (ms):                         43.97     
P99 ITL (ms):                            483.37    
==================================================

DCP 增益分析

此外,我们还分析了 DCP 带来的增益。

指标 TP8 TP8+DCP8 变化 提升 (%)
请求吞吐量 (req/s) 1.25 1.57 +0.32 +25.6%
输出 Token 吞吐量 (tok/s) 485.78 695.13 +209.35 +43.1%
平均 TTFT (秒) 271.2 227.8 -43.4 +16.0%
中位数 TTFT (秒) 227.4 227.1 -0.3 +0.1%

从服务启动日志中可以看出,KV 缓存 Token 数量增加了 8 倍。

  • TP8
    (Worker_TP0 pid=591236) INFO 11-06 12:08:54 [gpu_worker.py:349] Available KV cache memory: 46.80 GiB
    (EngineCore_DP0 pid=591074) INFO 11-06 12:08:55 [kv_cache_utils.py:1229] GPU KV cache size: 715,072 tokens
    
  • TP8+DCP8
    (Worker_TP0 pid=666845) INFO 11-06 15:34:58 [gpu_worker.py:349] Available KV cache memory: 46.80 GiB
    (EngineCore_DP0 pid=666657) INFO 11-06 15:34:59 [kv_cache_utils.py:1224] Multiplying the GPU KV cache size by the dcp_world_size 8.
    (EngineCore_DP0 pid=666657) INFO 11-06 15:34:59 [kv_cache_utils.py:1229] GPU KV cache size: 5,721,088 tokens
    

启用 DCP 可带来显著优势(生成 Token 速度提高 43%,吞吐量提高 26%),而弊端极小(中位数延迟略有改善)。我们建议阅读我们的 DCP 文档,并在您的 LLM 工作负载中尝试 DCP。