Qwen-VL-Dense(Qwen2.5VL-3B/7B, Qwen3-VL-2B/4B/8B/32B)#

简介#

阿里云的 Qwen-VL(Vision-Language)系列是由一系列强大的大型视觉语言模型(LVLMs)组成,专为全面的多模态理解而设计。它们接受图像、文本和边界框作为输入,并输出文本和检测框,从而能够实现图像检测、多模态对话和多图像推理等高级功能。

本文档将展示模型的主要验证步骤,包括支持的功能、功能配置、环境准备、NPU部署、准确性和性能评估。

本教程使用 vLLM-Ascend v0.11.0rc3-a3 版本进行演示,以 Qwen3-VL-8B-Instruct 模型为例进行单 NPU 部署,以 Qwen2.5-VL-32B-Instruct 模型为例进行多 NPU 部署。

支持的特性#

请参阅 支持的特性 获取模型的特性支持矩阵。

请参阅 特性指南 获取特性的配置方法。

环境准备#

模型权重#

需要 1 个 Atlas 800I A2 (64G × 8) 节点或 1 个 Atlas 800 A3 (64G × 16) 节点

可以在 modelslim 代码库中找到一个示例 Qwen2.5-VL 量化脚本。Qwen2.5-VL 量化脚本示例

建议将模型权重下载到多节点的共享目录中,例如 /root/.cache/

安装#

运行 docker 容器

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.12.0rc1

docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash

运行 docker 容器

# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.12.0rc1

docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /data:/data \
-it $IMAGE bash

设置环境变量

# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True

# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256

注意

max_split_size_mb 可防止原生分配器拆分大于此大小(单位:MB)的块。这可以减少碎片,并可能使一些处于临界状态的工作负载在不耗尽内存的情况下完成。您可以在此处找到更多详细信息。

部署#

离线推理#

运行以下脚本以在单 NPU 上执行离线推理

pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen3-VL-8B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    max_model_len=16384,
    limit_mm_per_prompt={"image": 10},
)

sampling_params = SamplingParams(
    max_tokens=512
)

image_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "Please provide a detailed description of this image"},
        ],
    },
]

messages = image_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

如果脚本运行成功,您将看到如下所示的信息

**Visual Components:**

1.  **Abstract Geometric Icon (Left Side):**
    *   The logo features a stylized, abstract icon on the left.
    *   It is composed of interconnected lines and angular shapes, forming a complex, hexagonal-like structure.
    *   The icon is rendered in a solid, thin blue line, giving it a modern, technological, and clean appearance.

2.  **Text (Right Side):**
    *   To the right of the icon, the name "TONGYI Qwen" is written.
    *   **"TONGYI"** is written in uppercase letters in a bold, modern sans-serif font. The color is a medium blue, matching the icon's color.
    *   **"Qwen"** is written below "TONGYI" in a slightly larger, bold, sans-serif font. The color of "Qwen" is a dark gray or black, creating a strong contrast with the blue text above it.
    *   The text is aligned and spaced neatly, with "Qwen" appearing slightly larger and bolder than "TONGYI," emphasizing the proper noun.

**Overall Design and Aesthetics:**

*   The logo has a clean, contemporary, and professional feel, suitable for a technology and AI product.
*   The use of blue conveys trust, innovation, and intelligence, while the dark gray adds stability and clarity.
*   The overall layout is balanced and symmetrical, with the icon and text arranged horizontally for easy recognition and memorability.
*   The design effectively communicates the product's high-tech nature while remaining brand-identifiable and straightforward.

The logo is designed to be easily recognizable across various media and scales, from digital screens to printed materials.

运行以下脚本以在多 NPU 上执行离线推理

pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen2.5-VL-32B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    tensor_parallel_size=2,
    max_model_len=16384,
    limit_mm_per_prompt={"image": 10},
)

sampling_params = SamplingParams(
    max_tokens=512
)

image_messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "Please provide a detailed description of this image"},
        ],
    },
]

messages = image_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

如果脚本运行成功,您将看到如下所示的信息

The image displays a logo and text related to the Qwen model, which is an artificial intelligence (AI) language model developed by Alibaba Cloud. Here is a detailed description of the elements in the image:

### **1. Logo:**
- The logo on the left side of the image consists of a stylized, abstract geometric design.
- The logo is primarily composed of interconnected lines and shapes that resemble a combination of arrows, lines, and geometric forms.
- The lines are arranged in a triangular pattern, giving it a dynamic and modern appearance.
- The lines are rendered in a dark blue color, and they form a three-dimensional, arrow-like structure. This conveys a sense of movement, forward momentum, or direction, which is often symbolic of progress and integration.
- The design appears to be complex yet minimalistic, with clean and sharp lines.
- The triangular and square-like structure suggests precision, connectivity, and innovation, which are often associated with technology and advanced systems.
- This abstract, arrow-like design implies a sense of flow, direction, and connectivity, which aligns with themes of progress and technological advancement.

### **2. Text:**
- **"TONGYI" (on the top right side):
  - The text is in dark blue, which is a color often associated with technology, stability, and trustworthiness.
  - The name "Tongyi" is written in a bold, sans-serif font, giving it a modern and professional look.
- **"Qwen" (below "Tongyi"):
  - The font for "Qwen" is in a bold, uppercase format.
  - The style

在线服务#

运行 docker 容器以在单 NPU 上启动 vLLM 服务器

vllm serve Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--max_model_len 16384 \
--max-num-batched-tokens 16384

注意

添加 --max_model_len 选项以避免 ValueError,因为 Qwen3-VL-8B-Instruct 模型的 max seq len (256000) 大于 KV 缓存中可以存储的最大令牌数。这会因 NPU 系列和 HBM 大小而异。请根据适合您 NPU 系列的值进行修改。

如果您的服务启动成功,您将看到如下所示的信息

INFO:     Started server process [2736]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

服务器启动后,您可以用输入提示查询模型

curl https://:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen3-VL-8B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustrate?"}
    ]}
    ]
    }'

如果您成功查询服务器,您将在客户端看到如下所示的信息

{"id":"chatcmpl-d3270d4a16cb4b98936f71ee3016451f","object":"chat.completion","created":1764924127,"model":"Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is: **TONGYI Qwen**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":107,"total_tokens":123,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

vllm 服务器的日志

INFO 12-05 08:42:07 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct
INFO 12-05 08:42:11 [acl_graph.py:187] Replaying aclgraph
INFO:     127.0.0.1:60988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-05 08:42:13 [loggers.py:127] Engine 000: Avg prompt throughput: 10.7 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-05 08:42:23 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

运行 docker 容器以在多 NPU 上启动 vLLM 服务器

#!/bin/sh
# if os is Ubuntu
apt update
apt install libjemalloc2 
# if os is openEuler
yum update
yum install jemalloc
# Add the LD_PRELOAD environment variable
if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then
    # On Ubuntu, first install with `apt install libjemalloc2`
    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
elif [ -f /usr/lib64/libjemalloc.so.2 ]; then
    # On openEuler, first install with `yum install jemalloc`
    export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
fi
# Enable the AIVector core to directly schedule ROCE communication
export HCCL_OP_EXPANSION_MODE="AIV"
# Set vLLM to Engine V1
export VLLM_USE_V1=1

vllm serve Qwen/Qwen2.5-VL-32B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --async-scheduling \
    --tensor-parallel-size 2 \
    --max-model-len 30000 \
    --max-num-batched-tokens 50000 \
    --max-num-seqs 30 \
    --no-enable-prefix-caching \
    --trust-remote-code \
    --dtype bfloat16

注意

添加 --max_model_len 选项以避免 ValueError,因为 Qwen2.5-VL-32B-Instruct 模型的 max_model_len (128000) 大于 KV 缓存中可以存储的最大令牌数。这会因 NPU 系列和 HBM 大小而异。请根据适合您 NPU 系列的值进行修改。

如果您的服务启动成功,您将看到如下所示的信息

INFO:     Started server process [14431]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

服务器启动后,您可以用输入提示查询模型

curl https://:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen2.5-VL-32B-Instruct",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustrate?"}
    ]}
    ]
    }'

如果您成功查询服务器,您将在客户端看到如下所示的信息

{"id":"chatcmpl-c07088bf992a4b77a89d79480122a483","object":"chat.completion","created":1764905884,"model":"Qwen/Qwen2.5-VL-32B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is:\n\n**TONGYI Qwen**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":73,"total_tokens":89,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

vllm 服务器的日志

The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
INFO 12-05 08:50:57 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-32B-Instruct
2025-12-05 08:50:58,913 - modelscope - INFO - Target directory already exists, skipping creation.
INFO 12-05 08:51:00 [acl_graph.py:187] Replaying aclgraph
INFO:     127.0.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-05 08:51:10 [loggers.py:127] Engine 000: Avg prompt throughput: 7.3 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-05 08:51:20 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

精度评估#

使用语言模型评估框架#

某些模型的准确性已在我们 CI 监控范围内,包括

  • Qwen2.5-VL-7B-Instruct

  • Qwen3-VL-8B-Instruct

您可以参考监控配置

mmmu_val 数据集为例,作为测试数据集,在离线模式下运行 Qwen3-VL-8B-Instruct 的准确性评估。

  1. 有关 lm_eval 安装的更多详细信息,请参阅使用 lm_eval

pip install lm_eval
  1. 运行 lm_eval 以执行准确性评估。

lm_eval \
    --model vllm-vlm \
    --model_args pretrained=Qwen/Qwen3-VL-8B-Instruct,max_model_len=8192,gpu_memory_utilization=0.7 \
    --tasks mmmu_val \
    --batch_size 32 \
    --apply_chat_template \
    --trust_remote_code \
    --output_path ./results
  1. 执行后,您可以获得结果。以下是 Qwen3-VL-8B-Instructvllm-ascend:0.11.0rc3 中的结果,仅供参考。

任务

版本

过滤

n-shot

指标

值(Value)

标准差

mmmu_val

0

准确率

0.5389

±

0.0159

mmmu_val 数据集为例,作为测试数据集,在离线模式下运行 Qwen2.5-VL-32B-Instruct 的准确性评估。

  1. 有关 lm_eval 安装的更多详细信息,请参阅使用 lm_eval

pip install lm_eval
  1. 运行 lm_eval 以执行准确性评估。

lm_eval \
    --model vllm-vlm \
    --model_args pretrained=Qwen/Qwen2.5-VL-32B-Instruct,max_model_len=8192,tensor_parallel_size=2 \
    --tasks mmmu_val \
    --apply_chat_template \
    --trust_remote_code \
    --output_path ./results
  1. 执行后,您可以获得结果。以下是 Qwen2.5-VL-32B-Instructvllm-ascend:0.11.0rc3 中的结果,仅供参考。

任务

版本

过滤

n-shot

指标

值(Value)

标准差

mmmu_val

0

准确率

0.5744

±

0.0158

性能#

使用 vLLM Benchmark#

更多详情请参阅 vllm benchmark

有三个 vllm bench 子命令

  • latency:对单批请求的延迟进行基准测试。

  • serve:对在线服务吞吐量进行基准测试。

  • throughput:对离线推理吞吐量进行基准测试。

性能评估必须在线模式下进行。以 serve 为例。运行代码如下。

vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct  --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model Qwen/Qwen2.5-VL-32B-Instruct  --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./

大约几分钟后,您就可以得到性能评估结果。