GLM-4V 使用指南¶
本指南介绍了如何运行原生 FP8 版本的 GLM-4.5V / GLM-4.6V。在 GLM-4.5V / GLM-4.6V 系列中,FP8 模型几乎没有精度损失。除非您需要严格的可复现性进行基准测试或类似场景,否则我们建议使用 FP8 以降低成本。
安装 vLLM¶
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto # vllm>=0.12.0 is required
在 4xH100 上使用 FP8 或 BF16 运行 GLM-4.5V / GLM-4.6V¶
有两种方法可以在多个 GPU 上并行化模型:(1) 张量并行或 (2) 数据并行。每种方法都有其自身的优点,其中张量并行通常更有利于低延迟/低负载场景,而数据并行更适用于数据量大且负载重的情况。
像这样运行张量并行
# Start server with FP8 model on 4 GPUs, the model can also be changed to BF16 as zai-org/GLM-4.5V
vllm serve zai-org/GLM-4.5V-FP8 \
--tensor-parallel-size 4 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--enable-expert-parallel \
--allowed-local-media-path / \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm
- 您可以设置
--max-model-len来节省内存。--max-model-len=65536通常适用于大多数场景。请注意,GLM-4.5V 只支持 64K 的上下文长度,而 GLM-4.6V 支持 128K 的上下文长度。 - 您可以设置
--max-num-batched-tokens来平衡吞吐量和延迟,数值越高意味着吞吐量越高但延迟也越高。--max-num-batched-tokens=32768通常适用于提示密集型工作负载。但您可以将其减少到 16k 和 8k 以减少激活内存使用并降低延迟。 - vLLM 保守地使用 90% 的 GPU 内存,您可以设置
--gpu-memory-utilization=0.95来最大化 KVCache。 - 请务必按照命令行说明操作,以确保工具调用功能已正确启用。
在 VisionArena-Chat 数据集上进行基准测试¶
一旦 zai-org/GLM-4.5V-FP8 模型的服务器运行起来,请打开另一个终端并运行基准测试客户端
vllm bench serve \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model zai-org/GLM-4.5V-FP8 \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 1000 \
--request-rate 20
结果¶
- 请求速率: 20,未设置最大并发
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Request rate configured (RPS): 20.00
Benchmark duration (s): 54.65
Total input tokens: 90524
Total generated tokens: 127152
Request throughput (req/s): 18.30
Output token throughput (tok/s): 2200.83
Peak output token throughput (tok/s): 8121.00
Peak concurrent requests: 283.00
Total Token throughput (tok/s): 3982.84
---------------Time to First Token----------------
Mean TTFT (ms): 1678.96
Median TTFT (ms): 1808.10
P99 TTFT (ms): 2790.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 72.62
Median TPOT (ms): 74.36
P99 TPOT (ms): 91.52
---------------Inter-token Latency----------------
Mean ITL (ms): 73.53
Median ITL (ms): 24.74
P99 ITL (ms): 434.72
==================================================
- 最大并发: 1,未设置请求速率
============ Serving Benchmark Result ============
Successful requests: 1000
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 1560.34
Total input tokens: 90524
Total generated tokens: 127049
Request throughput (req/s): 0.64
Output token throughput (tok/s): 81.42
Peak output token throughput (tok/s): 128.00
Peak concurrent requests: 3.00
Total Token throughput (tok/s): 139.44
---------------Time to First Token----------------
Mean TTFT (ms): 487.21
Median TTFT (ms): 591.48
P99 TTFT (ms): 1093.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 8.51
Median TPOT (ms): 8.47
P99 TPOT (ms): 9.16
---------------Inter-token Latency----------------
Mean ITL (ms): 8.53
Median ITL (ms): 8.45
P99 ITL (ms): 12.14
==================================================