Qwen3-Reranker#

简介#

Qwen3 Reranker 模型系列是 Qwen 系列最新的专有模型,专为文本嵌入和排序任务而设计。在 Qwen3 系列密集基础模型的基础上,它提供了各种尺寸(0.6B、4B 和 8B)的文本嵌入和重排模型。本指南介绍了如何使用 vLLM Ascend 运行模型。请注意,vLLM Ascend 仅支持 0.9.2rc1 及更高版本。

支持的特性#

请参阅 支持的特性 获取模型的特性支持矩阵。

环境准备#

模型权重#

建议将模型权重下载到多节点的共享目录中,例如 /root/.cache/

安装#

您可以使用我们的官方 Docker 镜像来运行 Qwen3-Reranker 系列模型。

  • 在您的节点上启动 Docker 镜像,请参考 使用 Docker

如果您不想使用上述 Docker 镜像,也可以从源码构建所有内容。

  • 从源码安装 vllm-ascend,请参考 安装

部署#

以 Qwen3-Reranker-8B 模型为例,首先使用以下命令运行 Docker 容器:

在线推理#

vllm serve Qwen/Qwen3-Reranker-8B --task score --host 127.0.0.1 --port 8888 --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'

服务器启动后,您可以通过以下示例发送请求。

requests 演示 + 格式化查询 & 文档#

import requests

url = "http://127.0.0.1:8888/v1/rerank"

# Please use the query_template and document_template to format the query and
# document for better reranker results.

prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
document_template = "<Document>: {doc}{suffix}"

instruction = (
    "Given a web search query, retrieve relevant passages that answer the query"
)

query = "What is the capital of China?"

documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

documents = [
    document_template.format(doc=doc, suffix=suffix) for doc in documents
]

response = requests.post(url,
                         json={
                             "query": query_template.format(prefix=prefix, instruction=instruction, query=query),
                             "documents": documents,
                         }).json()

print(response)

如果此脚本成功运行,您将在控制台中看到一个分数列表,类似于:

{'id': 'rerank-e856a17c954047a3a40f73d5ec43dbc6', 'model': 'Qwen/Qwen3-Reranker-8B', 'usage': {'total_tokens': 193}, 'results': [{'index': 0, 'document': {'text': '<Document>: The capital of China is Beijing.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n', 'multi_modal': None}, 'relevance_score': 0.9944348335266113}, {'index': 1, 'document': {'text': '<Document>: Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n', 'multi_modal': None}, 'relevance_score': 6.700084327349032e-07}]}

离线推理#

from vllm import LLM

model_name = "Qwen/Qwen3-Reranker-8B"

# What is the difference between the official original version and one
# that has been converted into a sequence classification model?
# Qwen3-Reranker is a language model that doing reranker by using the
# logits of "no" and "yes" tokens.
# It needs to computing 151669 tokens logits, making this method extremely
# inefficient, not to mention incompatible with the vllm score API.
# A method for converting the original model into a sequence classification
# model was proposed. See:https://hugging-face.cn/Qwen/Qwen3-Reranker-0.6B/discussions/3
# Models converted offline using this method can not only be more efficient
# and support the vllm score API, but also make the init parameters more
# concise, for example.
# model = LLM(model="Qwen/Qwen3-Reranker-8B", task="score")

# If you want to load the official original version, the init parameters are
# as follows.

model = LLM(
    model=model_name,
    task="score",
    hf_overrides={
        "architectures": ["Qwen3ForSequenceClassification"],
        "classifier_from_token": ["no", "yes"],
        "is_original_qwen3_reranker": True,
    },
)

# Why do we need hf_overrides for the official original version:
# vllm converts it to Qwen3ForSequenceClassification when loaded for
# better performance.
# - Firstly, we need using `"architectures": ["Qwen3ForSequenceClassification"],`
# to manually route to Qwen3ForSequenceClassification.
# - Then, we will extract the vector corresponding to classifier_from_token
# from lm_head using `"classifier_from_token": ["no", "yes"]`.
# - Third, we will convert these two vectors into one vector.  The use of
# conversion logic is controlled by `using "is_original_qwen3_reranker": True`.

# Please use the query_template and document_template to format the query and
# document for better reranker results.

prefix = '<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>\n<|im_start|>user\n'
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

query_template = "{prefix}<Instruct>: {instruction}\n<Query>: {query}\n"
document_template = "<Document>: {doc}{suffix}"

if __name__ == "__main__":
    instruction = (
        "Given a web search query, retrieve relevant passages that answer the query"
    )

    query = "What is the capital of China?"

    documents = [
        "The capital of China is Beijing.",
        "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
    ]

    documents = [document_template.format(doc=doc, suffix=suffix) for doc in documents]

    outputs = model.score(query_template.format(prefix=prefix, instruction=instruction, query=query), documents)

    print([output.outputs[0].score for output in outputs])

如果此脚本成功运行,您将在控制台中看到一个分数列表,类似于:

[0.9943699240684509, 6.876250040477316e-07]

性能#

Qwen3-Reranker-8B 为例的运行性能。有关更多详细信息,请参阅 vllm benchmark

serve 为例。运行代码如下。

vllm bench serve --model Qwen3-Reranker-8B --backend vllm-rerank --dataset-name random-rerank --host 127.0.0.1 --port 8888 --endpoint /v1/rerank --tokenizer /root/.cache/Qwen3-Reranker-8B  --random-input 200  --save-result --result-dir ./

大约几分钟后,您将获得性能评估结果。使用此教程,性能结果如下:

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  6.78
Total input tokens:                      108032
Request throughput (req/s):              31.11
Total Token throughput (tok/s):          15929.35
----------------End-to-end Latency----------------
Mean E2EL (ms):                          4422.79
Median E2EL (ms):                        4412.58
P99 E2EL (ms):                           6294.52
==================================================