跳到内容

moonshotai/Kimi-K2 使用指南

本指南描述了如何使用原生 FP8 运行 Kimi-K2。


注意: 本指南部分内容参考并改编自 Moonshot AI 提供的官方 Kimi-K2-Instruct 部署指南。在此对原作者表示感谢。

安装 vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

在 16xH800 上使用 FP8 运行 Kimi-K2

在主流 H800 平台上,Kimi-K2 FP8 权重的最小部署单元为包含 16 个 GPU 的集群,这些 GPU 可以采用张量并行 (TP) 或“数据并行 + 专家并行” (DP+EP) 的方式。下文提供了此环境的运行参数。您可以扩展到更多节点并增加专家并行度来增大推理批量大小和整体吞吐量。

张量并行 + 流水线并行

示例启动命令是

# start ray on node 0 and node 1

# node 0:
vllm serve moonshotai/Kimi-K2-Instruct --trust-remote-code --tokenizer-mode auto --tensor-parallel-size 8 --pipeline-parallel-size 2 --dtype bfloat16 --quantization fp8 --max-model-len 2048 --max-num-seqs 1 --max-num-batched-tokens 1024 --enable-chunked-prefill --disable-log-requests --kv-cache-dtype fp8 -dcp 8

关键参数注意事项

  • enable-auto-tool-choice:启用工具使用时必需。
  • tool-call-parser kimi_k2: 启用工具使用时必需。

数据并行 + 专家并行

您可以根据需要安装 DeepEP 和 DeepGEMM 等库。然后运行命令(H800 上的示例)

# node 0
vllm serve moonshotai/Kimi-K2-Instruct --port 8000 --served-model-name kimi-k2 --trust-remote-code --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address $MASTER_IP --data-parallel-rpc-port $PORT --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser kimi_k2

# node 1
vllm serve moonshotai/Kimi-K2-Instruct --headless --data-parallel-start-rank 8 --port 8000 --served-model-name kimi-k2 --trust-remote-code --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address $MASTER_IP --data-parallel-rpc-port $PORT --enable-expert-parallel --max-num-batched-tokens 8192 --max-num-seqs 256 --gpu-memory-utilization 0.85 --enable-auto-tool-choice --tool-call-parser kimi_k2

附加标志

  • 您可以设置 --max-model-len 以节省内存。--max-model-len=65536 通常适用于大多数场景。
  • 您可以设置 --max-num-batched-tokens 来平衡吞吐量和延迟,数值越高意味着吞吐量越高但延迟也越高。--max-num-batched-tokens=32768 通常适用于提示密集型工作负载。但您可以将其减少到 16k 和 8k 以减少激活内存使用并降低延迟。
  • vLLM 默认保守使用 90% 的 GPU 内存,您可以设置 --gpu-memory-utilization=0.95 来最大化 KVCache。

基准测试

16xH800 上的 FP8 基准测试

vllm bench serve \
  --model moonshotai/Kimi-K2-Instruct \
  --dataset-name random \
  --random-input-len 1000 \
  --random-output-len 512 \
  --request-rate 1.0 \
  --num-prompts 8 \
  --ignore-eos \
  --trust-remote-code

基准测试配置

通过更改 --num-prompts 测试不同的批处理大小

  • 批处理大小:1、16、32、64、128、256、512

预期输出

============ Serving Benchmark Result ============
Successful requests:                     8         
Request rate configured (RPS):           1.00      
Benchmark duration (s):                  132.79    
Total input tokens:                      8000      
Total generated tokens:                  4096      
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         30.84     
Total Token throughput (tok/s):          91.09     
---------------Time to First Token----------------
Mean TTFT (ms):                          58282.92  
Median TTFT (ms):                        57827.30  
P99 TTFT (ms):                           110831.45 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.78     
Median TPOT (ms):                        31.49     
P99 TPOT (ms):                           33.76     
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.78     
Median ITL (ms):                         22.37     
P99 ITL (ms):                            322.81    
==================================================

16xH200 上的 FP8 基准测试

vllm bench serve \
  --model moonshotai/Kimi-K2-Instruct \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos \
  --trust-remote-code

预期输出

============ Serving Benchmark Result ============
Successful requests:                     16        
Benchmark duration (s):                  62.75     
Total input tokens:                      128000    
Total generated tokens:                  16000     
Request throughput (req/s):              0.25      
Output token throughput (tok/s):         254.99    
Total Token throughput (tok/s):          2294.88   
---------------Time to First Token----------------
Mean TTFT (ms):                          4278.46   
Median TTFT (ms):                        4285.54   
P99 TTFT (ms):                           7685.31   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          58.15     
Median TPOT (ms):                        58.16     
P99 TPOT (ms):                           61.35     
---------------Inter-token Latency----------------
Mean ITL (ms):                           58.15     
Median ITL (ms):                         54.59     
P99 ITL (ms):                            91.18     
==================================================

添加 '-dcp 8' 后

============ Serving Benchmark Result ============
Successful requests:                     16        
Request rate configured (RPS):           10000.00  
Benchmark duration (s):                  47.14     
Total input tokens:                      128000    
Total generated tokens:                  16000     
Request throughput (req/s):              0.34      
Output token throughput (tok/s):         339.38    
Peak output token throughput (tok/s):    384.00    
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          3054.46   
---------------Time to First Token----------------
Mean TTFT (ms):                          2007.87   
Median TTFT (ms):                        1932.03   
P99 TTFT (ms):                           4680.76   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.01     
Median TPOT (ms):                        45.10     
P99 TPOT (ms):                           46.51     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.01     
Median ITL (ms):                         42.01     
P99 ITL (ms):                            52.01     
==================================================