MiniMax-M2.1/M2 使用指南¶

MiniMax-M2.1 和 MiniMax-M2 是由 MiniMax 创建的先进大语言模型。它们具有以下亮点：

卓越的智能——在数学、科学、编码和工具使用方面，在全球开源模型中排名第一。
高级编码——擅长多文件编辑、编码-运行-修复循环以及经过测试验证的修复。在 SWE-Bench 和 Terminal-Bench 任务上表现强劲。
代理性能——在 Shell、浏览器和代码环境中规划和执行复杂工具链。维护可追溯的证据，并能优雅地从错误中恢复。
高效设计——10B 激活参数（总计 230B）可为交互式和批量工作负载提供更低的延迟、成本和更高的吞吐量。

支持的模型¶

本指南适用于以下模型。部署时只需更新模型名称即可。以下示例使用 MiniMax-M2

安装 vLLM¶

如果您在使用 vLLM 服务这些模型时遇到输出损坏的问题，可以升级到 nightly 版本（确保版本在提交 cf3eacfe58fa9e745c2854782ada884a9f992cf7 之后）。

uv venv
source .venv/bin/activate
uv pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

使用 vLLM 启动 MiniMax-M2.1/M2¶

您可以使用 4x H200/H20 或 4x A100/A800 GPU 来启动此模型。

像这样运行张量并行

vllm serve MiniMaxAI/MiniMax-M2 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think  \
  --enable-auto-tool-choice

请注意，不支持 TP8。要使用超过 4 个 GPU 运行模型，请使用 DP+EP。

vllm serve MiniMaxAI/MiniMax-M2 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think  \
  --enable-auto-tool-choice

如果您遇到 torch.AcceleratorError: CUDA error: an illegal memory access was encountered，可以将 --compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}" 添加到启动参数中以解决此问题。

vllm serve MiniMaxAI/MiniMax-M2 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think  \
  --enable-auto-tool-choice \
  --compilation-config "{\"cudagraph_mode\": \"PIECEWISE\"}"

要以原生支持思考的 responsesAPI 运行模型，请使用 minimax_m2 reasoning parser 来运行它。

vllm serve MiniMaxAI/MiniMax-M2 \
  --tensor-parallel-size 4 \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice

性能指标¶

基准测试¶

我们使用以下脚本演示如何对 MiniMaxAI/MiniMax-M2 进行基准测试。

vllm bench serve \
  --backend vllm \
  --model MiniMaxAI/MiniMax-M2 \
  --endpoint /v1/completions \
  --dataset-name random \
  --random-input 2048 \
  --random-output 1024 \
  --max-concurrency 10 \
  --num-prompt 100

如果成功，您应该会看到类似以下的输出（TP 4 on NVIDIA_H20-3e *4）

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             10        
Benchmark duration (s):                  851.51    
Total input tokens:                      204800    
Total generated tokens:                  98734     
Request throughput (req/s):              0.12      
Output token throughput (tok/s):         115.95    
Peak output token throughput (tok/s):    130.00    
Peak concurrent requests:                20.00     
Total Token throughput (tok/s):          356.46    
---------------Time to First Token----------------
Mean TTFT (ms):                          520.98    
Median TTFT (ms):                        523.86    
P99 TTFT (ms):                           1086.48   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          82.82     
Median TPOT (ms):                        82.90     
P99 TPOT (ms):                           84.28     
---------------Inter-token Latency----------------
Mean ITL (ms):                           82.78     
Median ITL (ms):                         82.18     
P99 ITL (ms):                            83.48

使用技巧¶

DeepGEMM 用法¶

vLLM 默认启用了 DeepGEMM，请遵循设置说明进行安装。