GLM-4V 使用指南¶

本指南介绍了如何运行原生 FP8 版本的 GLM-4.5V / GLM-4.6V。在 GLM-4.5V / GLM-4.6V 系列中，FP8 模型几乎没有精度损失。除非您需要严格的可复现性进行基准测试或类似场景，否则我们建议使用 FP8 以降低成本。

安装 vLLM¶

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto # vllm>=0.12.0 is required

在 4xH100 上使用 FP8 或 BF16 运行 GLM-4.5V / GLM-4.6V¶

有两种方法可以在多个 GPU 上并行化模型：(1) 张量并行或 (2) 数据并行。每种方法都有其自身的优点，其中张量并行通常更有利于低延迟/低负载场景，而数据并行更适用于数据量大且负载重的情况。

像这样运行张量并行

# Start server with FP8 model on 4 GPUs, the model can also be changed to BF16 as zai-org/GLM-4.5V
vllm serve zai-org/GLM-4.5V-FP8 \
     --tensor-parallel-size 4 \
     --tool-call-parser glm45 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --enable-expert-parallel \
     --allowed-local-media-path / \
     --mm-encoder-tp-mode data \ 
     --mm-processor-cache-type shm

您可以设置 --max-model-len 来节省内存。--max-model-len=65536 通常适用于大多数场景。请注意，GLM-4.5V 只支持 64K 的上下文长度，而 GLM-4.6V 支持 128K 的上下文长度。
您可以设置 --max-num-batched-tokens 来平衡吞吐量和延迟，数值越高意味着吞吐量越高但延迟也越高。--max-num-batched-tokens=32768 通常适用于提示密集型工作负载。但您可以将其减少到 16k 和 8k 以减少激活内存使用并降低延迟。
vLLM 保守地使用 90% 的 GPU 内存，您可以设置 --gpu-memory-utilization=0.95 来最大化 KVCache。
请务必按照命令行说明操作，以确保工具调用功能已正确启用。

在 VisionArena-Chat 数据集上进行基准测试¶

一旦 zai-org/GLM-4.5V-FP8 模型的服务器运行起来，请打开另一个终端并运行基准测试客户端

vllm bench serve \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --model zai-org/GLM-4.5V-FP8 \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --num-prompts 1000 \
  --request-rate 20

结果¶

请求速率: 20，未设置最大并发

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Request rate configured (RPS):           20.00     
Benchmark duration (s):                  54.65
Total input tokens:                      90524     
Total generated tokens:                  127152    
Request throughput (req/s):              18.30     
Output token throughput (tok/s):         2200.83   
Peak output token throughput (tok/s):    8121.00   
Peak concurrent requests:                283.00
Total Token throughput (tok/s):          3982.84   
---------------Time to First Token----------------
Mean TTFT (ms):                          1678.96  
Median TTFT (ms):                        1808.10
P99 TTFT (ms):                           2790.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.62 
Median TPOT (ms):                        74.36    
P99 TPOT (ms):                           91.52    
---------------Inter-token Latency----------------
Mean ITL (ms):                           73.53    
Median ITL (ms):                         24.74     
P99 ITL (ms):                            434.72    
==================================================

最大并发: 1，未设置请求速率

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  1560.34   
Total input tokens:                      90524     
Total generated tokens:                  127049    
Request throughput (req/s):              0.64      
Output token throughput (tok/s):         81.42     
Peak output token throughput (tok/s):    128.00    
Peak concurrent requests:                3.00      
Total Token throughput (tok/s):          139.44    
---------------Time to First Token----------------
Mean TTFT (ms):                          487.21    
Median TTFT (ms):                        591.48    
P99 TTFT (ms):                           1093.84   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.51      
Median TPOT (ms):                        8.47      
P99 TPOT (ms):                           9.16      
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.53      
Median ITL (ms):                         8.45      
P99 ITL (ms):                            12.14     
==================================================