Qwen3-Embedding

Qwen3-Embedding#

简介#

Qwen3 Embedding 模型系列是 Qwen 系列最新的专有模型，专为文本嵌入和排序任务设计。在 Qwen3 系列密集基础模型的基础上，它提供了各种尺寸（0.6B、4B 和 8B）的全面文本嵌入和重排序模型。本指南介绍如何使用 vLLM Ascend 运行模型。请注意，只有 vLLM Ascend 的 0.9.2rc1 及更高版本才支持该模型。

支持的特性#

请参阅支持的特性获取模型的特性支持矩阵。

环境准备#

模型权重#

Qwen3-Embedding-8B 下载模型权重
Qwen3-Embedding-4B 下载模型权重
Qwen3-Embedding-0.6B 下载模型权重

建议将模型权重下载到多节点的共享目录中，例如 /root/.cache/

安装#

您可以使用我们官方的 Docker 镜像来运行 Qwen3-Embedding 系列模型。

在您的节点上启动 Docker 镜像，请参阅使用 Docker。

如果您不想使用上面的 Docker 镜像，也可以从源代码构建所有内容。

从源码安装 vllm-ascend，请参考安装。

部署#

以 Qwen3-Embedding-8B 模型为例，首先使用以下命令运行 Docker 容器：

在线推理#

vllm serve Qwen/Qwen3-Embedding-8B --task embed --host 127.0.0.1 --port 8888

服务器启动后，您可以使用输入提示查询模型。

curl http://127.0.0.1:8888/v1/embeddings -H "Content-Type: application/json" -d '{
  "input": [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
}'

离线推理#

import torch
import vllm
from vllm import LLM

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'


if __name__=="__main__":
    # Each query must come with a one-sentence instruction that describes the task
    task = 'Given a web search query, retrieve relevant passages that answer the query'

    queries = [
        get_detailed_instruct(task, 'What is the capital of China?'),
        get_detailed_instruct(task, 'Explain gravity')
    ]
    # No need to add instruction for retrieval documents
    documents = [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
    input_texts = queries + documents

    model = LLM(model="Qwen/Qwen3-Embedding-8B",
                task="embed",
                distributed_executor_backend="mp")

    outputs = model.embed(input_texts)
    embeddings = torch.tensor([o.outputs.embedding for o in outputs])
    scores = (embeddings[:2] @ embeddings[2:].T)
    print(scores.tolist())

如果脚本运行成功，您将看到如下所示的信息

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 282.22it/s]
Processed prompts:   0%|                                                                                                                                    | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s](VllmWorker rank=0 pid=4074750) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 31.95it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
[[0.7477798461914062, 0.07548339664936066], [0.0886271521449089, 0.6311039924621582]]

性能#

以 Qwen3-Reranker-8B 的运行性能为例。更多详细信息请参阅 vLLM benchmark。

以 serve 为例。运行代码如下。

vllm bench serve --model Qwen3-Embedding-8B --backend openai-embeddings --dataset-name random --host 127.0.0.1 --port 8888 --endpoint /v1/embeddings --tokenizer /root/.cache/Qwen3-Embedding-8B --random-input 200 --save-result --result-dir ./

大约几分钟后，您将获得性能评估结果。使用此教程，性能结果如下：

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  6.78
Total input tokens:                      108032
Request throughput (req/s):              31.11
Total Token throughput (tok/s):          15929.35
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4422.79
Median E2EL (ms):                        4412.58
P99 E2EL (ms):                           6294.52
==================================================