Qwen-VL-Dense(Qwen2.5VL-3B/7B, Qwen3-VL-2B/4B/8B/32B)#
简介#
阿里云的 Qwen-VL(Vision-Language)系列是由一系列强大的大型视觉语言模型(LVLMs)组成,专为全面的多模态理解而设计。它们接受图像、文本和边界框作为输入,并输出文本和检测框,从而能够实现图像检测、多模态对话和多图像推理等高级功能。
本文档将展示模型的主要验证步骤,包括支持的功能、功能配置、环境准备、NPU部署、准确性和性能评估。
本教程使用 vLLM-Ascend v0.11.0rc3-a3 版本进行演示,以 Qwen3-VL-8B-Instruct 模型为例进行单 NPU 部署,以 Qwen2.5-VL-32B-Instruct 模型为例进行多 NPU 部署。
支持的特性#
请参阅 支持的特性 获取模型的特性支持矩阵。
请参阅 特性指南 获取特性的配置方法。
环境准备#
模型权重#
需要 1 个 Atlas 800I A2 (64G × 8) 节点或 1 个 Atlas 800 A3 (64G × 16) 节点
Qwen2.5-VL-3B-Instruct: 下载模型权重Qwen2.5-VL-7B-Instruct: 下载模型权重Qwen2.5-VL-32B-Instruct:下载模型权重Qwen2.5-VL-72B-Instruct:下载模型权重Qwen3-VL-2B-Instruct: 下载模型权重Qwen3-VL-4B-Instruct: 下载模型权重Qwen3-VL-8B-Instruct: 下载模型权重Qwen3-VL-32B-Instruct: 下载模型权重
可以在 modelslim 代码库中找到一个示例 Qwen2.5-VL 量化脚本。Qwen2.5-VL 量化脚本示例
建议将模型权重下载到多节点的共享目录中,例如 /root/.cache/
安装#
运行 docker 容器
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.12.0rc1
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash
运行 docker 容器
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.12.0rc1
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /data:/data \
-it $IMAGE bash
设置环境变量
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
注意
max_split_size_mb 可防止原生分配器拆分大于此大小(单位:MB)的块。这可以减少碎片,并可能使一些处于临界状态的工作负载在不耗尽内存的情况下完成。您可以在此处找到更多详细信息。
部署#
离线推理#
运行以下脚本以在单 NPU 上执行离线推理
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
MODEL_PATH = "Qwen/Qwen3-VL-8B-Instruct"
llm = LLM(
model=MODEL_PATH,
max_model_len=16384,
limit_mm_per_prompt={"image": 10},
)
sampling_params = SamplingParams(
max_tokens=512
)
image_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
"min_pixels": 224 * 224,
"max_pixels": 1280 * 28 * 28,
},
{"type": "text", "text": "Please provide a detailed description of this image"},
],
},
]
messages = image_messages
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
如果脚本运行成功,您将看到如下所示的信息
**Visual Components:**
1. **Abstract Geometric Icon (Left Side):**
* The logo features a stylized, abstract icon on the left.
* It is composed of interconnected lines and angular shapes, forming a complex, hexagonal-like structure.
* The icon is rendered in a solid, thin blue line, giving it a modern, technological, and clean appearance.
2. **Text (Right Side):**
* To the right of the icon, the name "TONGYI Qwen" is written.
* **"TONGYI"** is written in uppercase letters in a bold, modern sans-serif font. The color is a medium blue, matching the icon's color.
* **"Qwen"** is written below "TONGYI" in a slightly larger, bold, sans-serif font. The color of "Qwen" is a dark gray or black, creating a strong contrast with the blue text above it.
* The text is aligned and spaced neatly, with "Qwen" appearing slightly larger and bolder than "TONGYI," emphasizing the proper noun.
**Overall Design and Aesthetics:**
* The logo has a clean, contemporary, and professional feel, suitable for a technology and AI product.
* The use of blue conveys trust, innovation, and intelligence, while the dark gray adds stability and clarity.
* The overall layout is balanced and symmetrical, with the icon and text arranged horizontally for easy recognition and memorability.
* The design effectively communicates the product's high-tech nature while remaining brand-identifiable and straightforward.
The logo is designed to be easily recognizable across various media and scales, from digital screens to printed materials.
运行以下脚本以在多 NPU 上执行离线推理
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
MODEL_PATH = "Qwen/Qwen2.5-VL-32B-Instruct"
llm = LLM(
model=MODEL_PATH,
tensor_parallel_size=2,
max_model_len=16384,
limit_mm_per_prompt={"image": 10},
)
sampling_params = SamplingParams(
max_tokens=512
)
image_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png",
"min_pixels": 224 * 224,
"max_pixels": 1280 * 28 * 28,
},
{"type": "text", "text": "Please provide a detailed description of this image"},
],
},
]
messages = image_messages
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, _, _ = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
}
outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
如果脚本运行成功,您将看到如下所示的信息
The image displays a logo and text related to the Qwen model, which is an artificial intelligence (AI) language model developed by Alibaba Cloud. Here is a detailed description of the elements in the image:
### **1. Logo:**
- The logo on the left side of the image consists of a stylized, abstract geometric design.
- The logo is primarily composed of interconnected lines and shapes that resemble a combination of arrows, lines, and geometric forms.
- The lines are arranged in a triangular pattern, giving it a dynamic and modern appearance.
- The lines are rendered in a dark blue color, and they form a three-dimensional, arrow-like structure. This conveys a sense of movement, forward momentum, or direction, which is often symbolic of progress and integration.
- The design appears to be complex yet minimalistic, with clean and sharp lines.
- The triangular and square-like structure suggests precision, connectivity, and innovation, which are often associated with technology and advanced systems.
- This abstract, arrow-like design implies a sense of flow, direction, and connectivity, which aligns with themes of progress and technological advancement.
### **2. Text:**
- **"TONGYI" (on the top right side):
- The text is in dark blue, which is a color often associated with technology, stability, and trustworthiness.
- The name "Tongyi" is written in a bold, sans-serif font, giving it a modern and professional look.
- **"Qwen" (below "Tongyi"):
- The font for "Qwen" is in a bold, uppercase format.
- The style
在线服务#
运行 docker 容器以在单 NPU 上启动 vLLM 服务器
vllm serve Qwen/Qwen3-VL-8B-Instruct \
--dtype bfloat16 \
--max_model_len 16384 \
--max-num-batched-tokens 16384
注意
添加 --max_model_len 选项以避免 ValueError,因为 Qwen3-VL-8B-Instruct 模型的 max seq len (256000) 大于 KV 缓存中可以存储的最大令牌数。这会因 NPU 系列和 HBM 大小而异。请根据适合您 NPU 系列的值进行修改。
如果您的服务启动成功,您将看到如下所示的信息
INFO: Started server process [2736]
INFO: Waiting for application startup.
INFO: Application startup complete.
服务器启动后,您可以用输入提示查询模型
curl https://:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustrate?"}
]}
]
}'
如果您成功查询服务器,您将在客户端看到如下所示的信息
{"id":"chatcmpl-d3270d4a16cb4b98936f71ee3016451f","object":"chat.completion","created":1764924127,"model":"Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is: **TONGYI Qwen**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":107,"total_tokens":123,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
vllm 服务器的日志
INFO 12-05 08:42:07 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct
INFO 12-05 08:42:11 [acl_graph.py:187] Replaying aclgraph
INFO: 127.0.0.1:60988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-05 08:42:13 [loggers.py:127] Engine 000: Avg prompt throughput: 10.7 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-05 08:42:23 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
运行 docker 容器以在多 NPU 上启动 vLLM 服务器
#!/bin/sh
# if os is Ubuntu
apt update
apt install libjemalloc2
# if os is openEuler
yum update
yum install jemalloc
# Add the LD_PRELOAD environment variable
if [ -f /usr/lib/aarch64-linux-gnu/libjemalloc.so.2 ]; then
# On Ubuntu, first install with `apt install libjemalloc2`
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
elif [ -f /usr/lib64/libjemalloc.so.2 ]; then
# On openEuler, first install with `yum install jemalloc`
export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
fi
# Enable the AIVector core to directly schedule ROCE communication
export HCCL_OP_EXPANSION_MODE="AIV"
# Set vLLM to Engine V1
export VLLM_USE_V1=1
vllm serve Qwen/Qwen2.5-VL-32B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--async-scheduling \
--tensor-parallel-size 2 \
--max-model-len 30000 \
--max-num-batched-tokens 50000 \
--max-num-seqs 30 \
--no-enable-prefix-caching \
--trust-remote-code \
--dtype bfloat16
注意
添加 --max_model_len 选项以避免 ValueError,因为 Qwen2.5-VL-32B-Instruct 模型的 max_model_len (128000) 大于 KV 缓存中可以存储的最大令牌数。这会因 NPU 系列和 HBM 大小而异。请根据适合您 NPU 系列的值进行修改。
如果您的服务启动成功,您将看到如下所示的信息
INFO: Started server process [14431]
INFO: Waiting for application startup.
INFO: Application startup complete.
服务器启动后,您可以用输入提示查询模型
curl https://:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-32B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustrate?"}
]}
]
}'
如果您成功查询服务器,您将在客户端看到如下所示的信息
{"id":"chatcmpl-c07088bf992a4b77a89d79480122a483","object":"chat.completion","created":1764905884,"model":"Qwen/Qwen2.5-VL-32B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is:\n\n**TONGYI Qwen**","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":73,"total_tokens":89,"completion_tokens":16,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
vllm 服务器的日志
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
INFO 12-05 08:50:57 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-32B-Instruct
2025-12-05 08:50:58,913 - modelscope - INFO - Target directory already exists, skipping creation.
INFO 12-05 08:51:00 [acl_graph.py:187] Replaying aclgraph
INFO: 127.0.0.1:50720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 12-05 08:51:10 [loggers.py:127] Engine 000: Avg prompt throughput: 7.3 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 12-05 08:51:20 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
精度评估#
使用语言模型评估框架#
某些模型的准确性已在我们 CI 监控范围内,包括
Qwen2.5-VL-7B-InstructQwen3-VL-8B-Instruct
您可以参考监控配置。
以 mmmu_val 数据集为例,作为测试数据集,在离线模式下运行 Qwen3-VL-8B-Instruct 的准确性评估。
有关
lm_eval安装的更多详细信息,请参阅使用 lm_eval。
pip install lm_eval
运行
lm_eval以执行准确性评估。
lm_eval \
--model vllm-vlm \
--model_args pretrained=Qwen/Qwen3-VL-8B-Instruct,max_model_len=8192,gpu_memory_utilization=0.7 \
--tasks mmmu_val \
--batch_size 32 \
--apply_chat_template \
--trust_remote_code \
--output_path ./results
执行后,您可以获得结果。以下是
Qwen3-VL-8B-Instruct在vllm-ascend:0.11.0rc3中的结果,仅供参考。
任务 |
版本 |
过滤 |
n-shot |
指标 |
值(Value) |
标准差 |
||
|---|---|---|---|---|---|---|---|---|
mmmu_val |
0 |
无 |
准确率 |
↑ |
0.5389 |
± |
0.0159 |
以 mmmu_val 数据集为例,作为测试数据集,在离线模式下运行 Qwen2.5-VL-32B-Instruct 的准确性评估。
有关
lm_eval安装的更多详细信息,请参阅使用 lm_eval。
pip install lm_eval
运行
lm_eval以执行准确性评估。
lm_eval \
--model vllm-vlm \
--model_args pretrained=Qwen/Qwen2.5-VL-32B-Instruct,max_model_len=8192,tensor_parallel_size=2 \
--tasks mmmu_val \
--apply_chat_template \
--trust_remote_code \
--output_path ./results
执行后,您可以获得结果。以下是
Qwen2.5-VL-32B-Instruct在vllm-ascend:0.11.0rc3中的结果,仅供参考。
任务 |
版本 |
过滤 |
n-shot |
指标 |
值(Value) |
标准差 |
||
|---|---|---|---|---|---|---|---|---|
mmmu_val |
0 |
无 |
准确率 |
↑ |
0.5744 |
± |
0.0158 |
性能#
使用 vLLM Benchmark#
更多详情请参阅 vllm benchmark。
有三个 vllm bench 子命令
latency:对单批请求的延迟进行基准测试。serve:对在线服务吞吐量进行基准测试。throughput:对离线推理吞吐量进行基准测试。
性能评估必须在线模式下进行。以 serve 为例。运行代码如下。
vllm bench serve --model Qwen/Qwen3-VL-8B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
vllm bench serve --model Qwen/Qwen2.5-VL-32B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
大约几分钟后,您就可以得到性能评估结果。