Ring-1T-FP8 使用指南¶

本指南介绍如何运行 Ring-1T-FP8。

安装 vLLM¶

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

在 8 块 H200 GPU 上使用 FP8 KV Cache 运行 Ring-1T-FP8¶

本指南涵盖了运行模型的最简单方法，即跨 8 块 GPU 使用纯张量并行。

# Start server with FP8 model on 8 GPUs
vllm serve inclusionAI/Ring-1T-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.97 \
  --max_num_seqs 32 \
  --kv-cache-dtype fp8 \
  --compilation-config '{"use_inductor": false}' \
  --served-model-name Ring-1T-FP8

您可以设置 --max-model-len 以节省内存。--max-model-len=65536 通常适用于大多数场景。
您可以设置 --max-num-batched-tokens 来平衡吞吐量和延迟，设置得更高意味着更高的吞吐量但更高的延迟。--max-num-batched-tokens=32768 通常对提示词密集型工作负载效果很好。但您可以将其降低到 16384 和 8192 以减少激活内存使用并降低延迟。
在此示例中，97% 的总内存用于此模型，如果出现内存不足 (OOM) 错误，您可以将其减小到更小的数值。

发送示例请求¶

您可以发送类似以下内容的请求来快速验证部署。

curl https://:8000/v1/chat/completions
    -H "Content-Type: application/json" \
    -d '{
        "model": "Ring-1T-FP8",
        "messages": [
            {
                "role": "user",
                "content": "9.11 and 9.8, which is greater?"
            }
        ]
    }'