使用 EvalScope#

本文档将指导您使用 EvalScope 进行模型推理压力测试和准确性测试。

1. 在线服务器#

您可以在单个 NPU 上运行 docker 容器来启动 vLLM 服务器

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.12.0rc1
docker run --rm \
--shm-size=1g \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240

如果 vLLM 服务器成功启动,您将看到如下信息

INFO:     Started server process [6873]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

服务器启动后,您可以在新的终端中用输入提示查询模型

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "prompt": "The future of AI is",
        "max_tokens": 7,
        "temperature": 0
    }'

2. 使用 pip 安装 EvalScope#

您可以按如下方式安装 EvalScope

python3 -m venv .venv-evalscope
source .venv-evalscope/bin/activate
pip install gradio plotly evalscope

3. 使用 EvalScope 运行 GSM8K 进行准确性测试#

您可以使用 evalscope eval 来运行 GSM8K 进行准确性测试

evalscope eval \
 --model Qwen/Qwen2.5-7B-Instruct \
 --api-url http://localhost:8000/v1 \
 --api-key EMPTY \
 --eval-type service \
 --datasets gsm8k \
 --limit 10

1 到 2 分钟后,输出如下

+---------------------+-----------+-----------------+----------+-------+---------+---------+
| Model               | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+=====================+===========+=================+==========+=======+=========+=========+
| Qwen2.5-7B-Instruct | gsm8k     | AverageAccuracy | main     |    10 |     0.8 | default |
+---------------------+-----------+-----------------+----------+-------+---------+---------+

更多详情请参阅 EvalScope 文档 - 模型 API 服务评估

4. 使用 EvalScope 运行模型推理压力测试#

使用 pip 安装 EvalScope[perf]#

pip install evalscope[perf] -U

基本用法#

您可以使用 evalscope perf 来运行性能测试

evalscope perf \
    --url "https://:8000/v1/chat/completions" \
    --parallel 5 \
    --model Qwen/Qwen2.5-7B-Instruct \
    --number 20 \
    --api openai \
    --dataset openqa \
    --stream

输出结果#

1 到 2 分钟后,输出如下

Benchmarking summary:
+-----------------------------------+---------------------------------------------------------------+
| Key                               | Value                                                         |
+===================================+===============================================================+
| Time taken for tests (s)          | 38.3744                                                       |
+-----------------------------------+---------------------------------------------------------------+
| Number of concurrency             | 5                                                             |
+-----------------------------------+---------------------------------------------------------------+
| Total requests                    | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
| Succeed requests                  | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
| Failed requests                   | 0                                                             |
+-----------------------------------+---------------------------------------------------------------+
| Output token throughput (tok/s)   | 132.6926                                                      |
+-----------------------------------+---------------------------------------------------------------+
| Total token throughput (tok/s)    | 158.8819                                                      |
+-----------------------------------+---------------------------------------------------------------+
| Request throughput (req/s)        | 0.5212                                                        |
+-----------------------------------+---------------------------------------------------------------+
| Average latency (s)               | 8.3612                                                        |
+-----------------------------------+---------------------------------------------------------------+
| Average time to first token (s)   | 0.1035                                                        |
+-----------------------------------+---------------------------------------------------------------+
| Average time per output token (s) | 0.0329                                                        |
+-----------------------------------+---------------------------------------------------------------+
| Average input tokens per request  | 50.25                                                         |
+-----------------------------------+---------------------------------------------------------------+
| Average output tokens per request | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
| Average package latency (s)       | 0.0324                                                        |
+-----------------------------------+---------------------------------------------------------------+
| Average package per request       | 254.6                                                         |
+-----------------------------------+---------------------------------------------------------------+
| Expected number of requests       | 20                                                            |
+-----------------------------------+---------------------------------------------------------------+
| Result DB path                    | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db |
+-----------------------------------+---------------------------------------------------------------+

Percentile results:
+------------+----------+---------+-------------+--------------+---------------+----------------------+
| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+---------+-------------+--------------+---------------+----------------------+
|    10%     |  0.0962  |  0.031  |   4.4571    |      42      |      135      |       29.9767        |
|    25%     |  0.0971  | 0.0318  |   6.3509    |      47      |      193      |       30.2157        |
|    50%     |  0.0987  | 0.0321  |   9.3387    |      49      |      285      |       30.3969        |
|    66%     |  0.1017  | 0.0324  |   9.8519    |      52      |      302      |       30.5182        |
|    75%     |  0.107   | 0.0328  |   10.2391   |      55      |      313      |       30.6124        |
|    80%     |  0.1221  | 0.0329  |   10.8257   |      58      |      330      |       30.6759        |
|    90%     |  0.1245  | 0.0333  |   13.0472   |      62      |      404      |       30.9644        |
|    95%     |  0.1247  | 0.0336  |   14.2936   |      66      |      432      |       31.6691        |
|    98%     |  0.1247  | 0.0353  |   14.2936   |      66      |      432      |       31.6691        |
|    99%     |  0.1247  | 0.0627  |   14.2936   |      66      |      432      |       31.6691        |
+------------+----------+---------+-------------+--------------+---------------+----------------------+

更多详情请参阅 EvalScope 文档 - 模型推理压力测试