Ernie4.5 文本模型使用指南¶
本指南介绍了如何使用原生 BF16 运行 ERNIE-4.5-21B-A3B-PT 和 ERNIE-4.5-300B-A47B-PT。
安装 vLLM¶
注意:transformers >= 4.54.0 和 vllm >= 0.10.1
运行 Ernie4.5¶
# 300B model 80G*8 GPU with vllm FP8 online quantification
vllm serve baidu/ERNIE-4.5-300B-A47B-PT \
--tensor-parallel-size 8 \
--gpu-memory-utilization=0.95 \
--quantization fp8
如果您的单节点 GPU 内存不足,原生 BF16 部署可能需要多节点。多节点部署请参考 vLLM 文档 来启动 ray 集群。然后在主节点上运行 vllm。
# 300B model 80G*16 GPU with native BF16
vllm serve baidu/ERNIE-4.5-300B-A47B-PT \
--tensor-parallel-size 16
运行 Ernie4.5 MTP¶
# 21B MTP model 80G*1 GPU
vllm serve baidu/ERNIE-4.5-21B-A3B-PT \
--speculative-config '{"method": "ernie_mtp","model": "baidu/ERNIE-4.5-21B-A3B-PT","num_speculative_tokens": 1}'
# 300B MTP model 80G*8 GPU with vllm FP8 online quantification
vllm serve baidu/ERNIE-4.5-300B-A47B-PT \
--tensor-parallel-size 8 \
--gpu-memory-utilization=0.95 \
--quantization fp8 \
--speculative-config '{"method": "ernie_mtp","model": "baidu/ERNIE-4.5-300B-A47B-PT","num_speculative_tokens": 1}'
如果您的单节点 GPU 内存不足,原生 BF16 部署可能需要多节点。多节点部署请参考 vLLM 文档 来启动 ray 集群。然后在主节点上运行 vllm。
# 300B MTP model 80G*16 GPU with native BF16
vllm serve baidu/ERNIE-4.5-300B-A47B-PT \
--tensor-parallel-size 16 \
--speculative-config '{"method": "ernie_mtp","model": "baidu/ERNIE-4.5-300B-A47B-PT","num_speculative_tokens": 1}'
基准测试¶
为了进行基准测试,只在服务启动后运行第一次 vllm bench serve,以确保不受前缀缓存的影响。
# Prompt-heavy benchmark (8k/1k)
vllm bench serve \
--model baidu/ERNIE-4.5-21B-A3B-PT \
--host 127.0.0.1 \
--port 8200 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--request-rate 10 \
--num-prompts 16 \
--ignore-eos
基准测试配置¶
通过调整输入/输出长度测试不同的工作负载
- 提示密集型:8000 输入 / 1000 输出
- 解码密集型:1000 输入 / 8000 输出
- 平衡型:1000 输入 / 1000 输出
通过更改 --num-prompts 来测试不同的批次大小,例如 1, 16, 32, 64, 128, 256, 512
预期输出¶
============ Serving Benchmark Result ============
Successful requests: 16
Request rate configured (RPS): 10.00
Benchmark duration (s): 18.65
Total input tokens: 127952
Total generated tokens: 16000
Request throughput (req/s): 0.86
Output token throughput (tok/s): 857.78
Total Token throughput (tok/s): 7717.46
---------------Time to First Token----------------
Mean TTFT (ms): 876.28
Median TTFT (ms): 910.42
P99 TTFT (ms): 1596.48
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 16.84
Median TPOT (ms): 16.86
P99 TPOT (ms): 18.11
---------------Inter-token Latency----------------
Mean ITL (ms): 16.84
Median ITL (ms): 15.49
P99 ITL (ms): 20.69
==================================================