性能基准测试#
本文档详细介绍了 vllm-ascend 的基准测试方法,旨在评估在各种工作负载下的性能。为了与 vLLM 保持一致,我们使用了 vllm 项目提供的 benchmark 脚本。
基准测试覆盖范围:我们测量离线 E2E 延迟和吞吐量,以及固定 QPS 在线服务基准测试。有关更多详细信息,请参阅 vllm-ascend 基准测试脚本。
1. 运行 docker 容器#
# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.12.0rc1
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-it $IMAGE \
/bin/bash
2. 安装依赖#
cd /workspace/vllm-ascend
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r benchmarks/requirements-bench.txt
3. 运行基本基准测试#
本节介绍如何使用 VLLM 内置的基准测试套件进行性能测试。
3.1 数据集#
VLLM 支持多种(数据集)[https://github.com/vllm-project/vllm/blob/main/vllm/benchmarks/datasets.py]。
数据集 |
在线 |
离线 |
数据路径 |
|---|---|---|---|
ShareGPT |
✅ |
✅ |
|
ShareGPT4V (图像) |
✅ |
✅ |
|
ShareGPT4Video (视频) |
✅ |
✅ |
|
BurstGPT |
✅ |
✅ |
|
Sonnet (已弃用) |
✅ |
✅ |
本地文件: |
随机 |
✅ |
✅ |
|
RandomMultiModal (图像/视频) |
🟡 |
🚧 |
|
RandomForReranking |
✅ |
✅ |
|
前缀重复 |
✅ |
✅ |
|
HuggingFace-VisionArena |
✅ |
✅ |
|
HuggingFace-MMVU |
✅ |
✅ |
|
HuggingFace-InstructCoder |
✅ |
✅ |
|
HuggingFace-AIMO |
✅ |
✅ |
|
HuggingFace-Other |
✅ |
✅ |
|
HuggingFace-MTBench |
✅ |
✅ |
|
HuggingFace-Blazedit |
✅ |
✅ |
|
Spec Bench |
✅ |
✅ |
|
自定义 |
✅ |
✅ |
本地文件: |
注意
上述数据集都是 huggingface 上的数据集链接。数据集的 dataset-name 应设置为 hf。对于本地 dataset-path,请将其 hf-name 设置为其 Hugging Face ID,例如:
--dataset-path /datasets/VisionArena-Chat/ --hf-name lmarena-ai/VisionArena-Chat
3.2 运行基本基准测试#
3.2.1 在线服务#
首先启动模型的服务
VLLM_USE_MODELSCOPE=True vllm serve Qwen/Qwen3-8B
然后运行基准测试脚本
# download dataset
# wget https://hugging-face.cn/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
export VLLM_USE_MODELSCOPE=True
vllm bench serve \
--backend vllm \
--model Qwen/Qwen3-8B \
--endpoint /v1/completions \
--dataset-name sharegpt \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 10
如果成功,您将看到以下输出
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 19.92
Total input tokens: 1374
Total generated tokens: 2663
Request throughput (req/s): 0.50
Output token throughput (tok/s): 133.67
Peak output token throughput (tok/s): 312.00
Peak concurrent requests: 10.00
Total Token throughput (tok/s): 202.64
---------------Time to First Token----------------
Mean TTFT (ms): 127.10
Median TTFT (ms): 136.29
P99 TTFT (ms): 137.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 25.85
Median TPOT (ms): 25.78
P99 TPOT (ms): 26.64
---------------Inter-token Latency----------------
Mean ITL (ms): 25.78
Median ITL (ms): 25.74
P99 ITL (ms): 28.85
==================================================
3.2.2 离线吞吐量基准测试#
VLLM_USE_MODELSCOPE=True
vllm bench throughput \
--model Qwen/Qwen3-8B \
--dataset-name random \
--input-len 128 \
--output-len 128
如果成功,您将看到以下输出
Processed prompts: 100%|█| 10/10 [00:03<00:00, 2.74it/s, est. speed input: 351.02 toks/s, output: 351.02 t
Throughput: 2.73 requests/s, 699.93 total tokens/s, 349.97 output tokens/s
Total num prompt tokens: 1280
Total num output tokens: 1280
3.2.4 多模态基准测试#
export VLLM_USE_MODELSCOPE=True
vllm serve Qwen/Qwen2.5-VL-7B-Instruct \
--dtype bfloat16 \
--limit-mm-per-prompt '{"image": 1}' \
--allowed-local-media-path /path/to/sharegpt4v/images
export HF_ENDPOINT="https://hf-mirror.com"
vllm bench serve --model Qwen/Qwen2.5-VL-7B-Instruct \
--backend "openai-chat" \
--dataset-name hf \
--hf-split train \
--endpoint "/v1/chat/completions" \
--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
--num-prompts 10 \
--no-stream
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 4.89
Total input tokens: 7191
Total generated tokens: 951
Request throughput (req/s): 2.05
Output token throughput (tok/s): 194.63
Peak output token throughput (tok/s): 290.00
Peak concurrent requests: 10.00
Total Token throughput (tok/s): 1666.35
---------------Time to First Token----------------
Mean TTFT (ms): 722.22
Median TTFT (ms): 589.81
P99 TTFT (ms): 1377.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 44.13
Median TPOT (ms): 34.58
P99 TPOT (ms): 124.72
---------------Inter-token Latency----------------
Mean ITL (ms): 33.14
Median ITL (ms): 28.01
P99 ITL (ms): 182.28
==================================================
3.2.5 嵌入基准测试#
vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
# download dataset
# wget https://hugging-face.cn/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
export VLLM_USE_MODELSCOPE=true
vllm bench serve \
--model Qwen/Qwen3-Embedding-8B \
--backend openai-embeddings \
--endpoint /v1/embeddings \
--dataset-name sharegpt \
--num-prompt 10 \
--dataset-path <your dataset path>/datasets/ShareGPT_V3_unfiltered_cleaned_split.json
============ Serving Benchmark Result ============
Successful requests: 10
Failed requests: 0
Benchmark duration (s): 0.18
Total input tokens: 1372
Request throughput (req/s): 56.32
Total Token throughput (tok/s): 7726.76
----------------End-to-end Latency----------------
Mean E2EL (ms): 154.06
Median E2EL (ms): 165.57
P99 E2EL (ms): 166.66
==================================================