Kimi-K2-Thinking 使用指南¶
Kimi K2 Thinking 是由 moonshotai 创建的一个先进的大型语言模型。它具有以下亮点:
- 深度思考 & 工具编排:端到端训练,能够将思维链推理与函数调用交织在一起,从而实现持续数百步而不会产生漂移的自主研究、编码和写作工作流。
- 原生 INT4 量化:在训练后阶段采用量化感知训练 (QAT),在低延迟模式下实现无损 2 倍速度提升。
- 稳定的长程代理能力:在长达 200-300 次连续工具调用中保持连贯的目标导向行为,超越了在 30-50 步后性能会下降的先前模型。
安装 vLLM¶
使用 vLLM 启动 Kimi-K2-Thinking¶
您可以使用 8 个 H200/H20 来启动此模型。有关低延迟和高吞吐量场景的详细启动参数,请参阅下面的章节。
低延迟场景
像这样运行张量并行`--reasoning-parser` 标志指定了用于从模型输出中提取推理内容的推理解析器。高吞吐量场景
vLLM 支持 [Decode Context Parallel](https://docs.vllm.com.cn/en/latest/serving/context_parallel_deployment.html#decode-context-parallel),在高吞吐量场景中具有显著优势。您可以通过添加 `--decode-context-parallel-size number` 来启用 DCP,例如:`--reasoning-parser` 标志指定了用于从模型输出中提取推理内容的推理解析器。指标¶
我们测试了两种启动脚本(TP8 vs TP8+DCP8)的 GSM8K 准确性。
- TP8
local-completions (model=moonshotai/Kimi-K2-Thinking,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9416|± |0.0065|
| | |strict-match | 5|exact_match|↑ |0.9409|± |0.0065|
- TP8+DCP8
local-completions (model=moonshotai/Kimi-K2-Thinking,temperature=0.0,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32,timeout=3000), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9386|± |0.0066|
| | |strict-match | 5|exact_match|↑ |0.9371|± |0.0067|
基准测试¶
我们使用以下脚本在 8*H200 上对 moonshotai/Kimi-K2-Thinking 进行了基准测试。
vllm bench serve \
--model moonshotai/Kimi-K2-Thinking \
--dataset-name random \
--random-input 8000 \
--random-output 4000 \
--request-rate 100 \
--num-prompt 1000 \
--trust-remote-code
我们分别对 TP8 和 TP8+DCP8 的性能进行了基准测试。
TP8 基准测试输出¶
============ Serving Benchmark Result ============
Successful requests: 998
Failed requests: 2
Request rate configured (RPS): 100.00
Benchmark duration (s): 800.26
Total input tokens: 7984000
Total generated tokens: 388750
Request throughput (req/s): 1.25
Output token throughput (tok/s): 485.78
Peak output token throughput (tok/s): 2100.00
Peak concurrent requests: 988.00
Total Token throughput (tok/s): 10462.57
---------------Time to First Token----------------
Mean TTFT (ms): 271196.67
Median TTFT (ms): 227389.87
P99 TTFT (ms): 686294.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 381.29
Median TPOT (ms): 473.04
P99 TPOT (ms): 578.64
---------------Inter-token Latency----------------
Mean ITL (ms): 111.22
Median ITL (ms): 38.04
P99 ITL (ms): 490.93
==================================================
TP8+DCP8 基准测试输出¶
============ Serving Benchmark Result ============
Successful requests: 994
Failed requests: 6
Request rate configured (RPS): 100.00
Benchmark duration (s): 631.35
Total input tokens: 7952000
Total generated tokens: 438872
Request throughput (req/s): 1.57
Output token throughput (tok/s): 695.13
Peak output token throughput (tok/s): 2618.00
Peak concurrent requests: 988.00
Total Token throughput (tok/s): 13290.35
---------------Time to First Token----------------
Mean TTFT (ms): 227780.13
Median TTFT (ms): 227055.20
P99 TTFT (ms): 451255.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 398.60
Median TPOT (ms): 472.81
P99 TPOT (ms): 569.91
---------------Inter-token Latency----------------
Mean ITL (ms): 104.73
Median ITL (ms): 43.97
P99 ITL (ms): 483.37
==================================================
DCP 增益分析¶
此外,我们还分析了 DCP 带来的增益。
| 指标 | TP8 | TP8+DCP8 | 变化 | 提升 (%) |
|---|---|---|---|---|
| 请求吞吐量 (req/s) | 1.25 | 1.57 | +0.32 | +25.6% |
| 输出 Token 吞吐量 (tok/s) | 485.78 | 695.13 | +209.35 | +43.1% |
| 平均 TTFT (秒) | 271.2 | 227.8 | -43.4 | +16.0% |
| 中位数 TTFT (秒) | 227.4 | 227.1 | -0.3 | +0.1% |
从服务启动日志中可以看出,KV 缓存 Token 数量增加了 8 倍。
- TP8
- TP8+DCP8
(Worker_TP0 pid=666845) INFO 11-06 15:34:58 [gpu_worker.py:349] Available KV cache memory: 46.80 GiB (EngineCore_DP0 pid=666657) INFO 11-06 15:34:59 [kv_cache_utils.py:1224] Multiplying the GPU KV cache size by the dcp_world_size 8. (EngineCore_DP0 pid=666657) INFO 11-06 15:34:59 [kv_cache_utils.py:1229] GPU KV cache size: 5,721,088 tokens
启用 DCP 可带来显著优势(生成 Token 速度提高 43%,吞吐量提高 26%),而弊端极小(中位数延迟略有改善)。我们建议阅读我们的 DCP 文档,并在您的 LLM 工作负载中尝试 DCP。