OpenAI 兼容服务器
vLLM 提供了一个实现 OpenAI Completions API、Chat API 等功能的 HTTP 服务器!通过此功能,您可以使用 HTTP 客户端提供模型服务并与之交互。
在您的终端中,您可以安装 vLLM,然后使用 vllm serve
命令启动服务器。(您也可以使用我们的Docker 镜像。)
要调用服务器,请在您喜欢的文本编辑器中创建一个使用 HTTP 客户端的脚本。包含您想要发送给模型的任何消息。然后运行该脚本。下面是一个使用官方 OpenAI Python 客户端的示例脚本。
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
提示
vLLM 支持 OpenAI 不支持的一些参数,例如 top_k
。您可以使用 OpenAI 客户端通过请求的 extra_body
参数将这些参数传递给 vLLM,例如 extra_body={"top_k": 50}
用于设置 top_k
。
警告
默认情况下,如果存在 generation_config.json
文件,服务器会应用 Hugging Face 模型仓库中的该文件。这意味着某些采样参数的默认值可能会被模型创建者推荐的值覆盖。
要禁用此行为,请在启动服务器时传递 --generation-config vllm
参数。
支持的 API¶
我们目前支持以下 OpenAI API
- Completions API (
/v1/completions
)- 仅适用于文本生成模型 (
--task generate
)。 - 注意:不支持
suffix
参数。
- 仅适用于文本生成模型 (
- Chat Completions API (
/v1/chat/completions
) - Embeddings API (
/v1/embeddings
)- 仅适用于嵌入模型 (
--task embed
)。
- 仅适用于嵌入模型 (
- Transcriptions API (
/v1/audio/transcriptions
)- 仅适用于自动语音识别 (ASR) 模型 (OpenAI Whisper) (
--task generate
)。
- 仅适用于自动语音识别 (ASR) 模型 (OpenAI Whisper) (
此外,我们还提供以下自定义 API
- Tokenizer API (
/tokenize
,/detokenize
)- 适用于任何带分词器的模型。
- Pooling API (
/pooling
)- 适用于所有池化模型。
- Classification API (
/classify
)- 仅适用于分类模型 (
--task classify
)。
- 仅适用于分类模型 (
- Score API (
/score
)- 适用于嵌入模型和交叉编码器模型 (
--task score
)。
- 适用于嵌入模型和交叉编码器模型 (
- Re-rank API (
/rerank
,/v1/rerank
,/v2/rerank
)- 实现了 Jina AI 的 v1 re-rank API
- 也兼容 Cohere 的 v1 和 v2 re-rank API
- Jina 和 Cohere 的 API 非常相似;Jina 的 rerank 端点响应中包含额外信息。
- 仅适用于交叉编码器模型 (
--task score
)。
聊天模板¶
为了让语言模型支持聊天协议,vLLM 要求模型在其分词器配置中包含一个聊天模板。聊天模板是一个 Jinja2 模板,指定了如何在输入中编码角色、消息和其他特定于聊天的 token。
NousResearch/Meta-Llama-3-8B-Instruct
的聊天模板示例可以在此处找到。
有些模型即使经过指令/聊天微调,也没有提供聊天模板。对于这些模型,您可以在 --chat-template
参数中手动指定其聊天模板的文件路径,或以字符串形式提供模板。如果没有聊天模板,服务器将无法处理聊天请求,所有聊天请求都将出错。
vLLM 社区为常用模型提供了一系列聊天模板。您可以在 examples 目录下找到它们。
随着多模态聊天 API 的加入,OpenAI 规范现在接受以新格式表示的聊天消息,该格式同时指定了 type
和 text
字段。示例如下所示
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
]
)
大多数 LLM 的聊天模板都期望 content
字段是字符串,但也有一些较新的模型,如 meta-llama/Llama-Guard-3-1B
,期望内容按照请求中的 OpenAI schema 格式化。vLLM 会尽力自动检测这一点,并将其记录为类似 "Detected the chat template content format to be..." 的字符串,然后内部将传入的请求转换为匹配检测到的格式,格式可以是以下之一:
"string"
:字符串。- 示例:
"Hello world"
- 示例:
"openai"
:字典列表,类似于 OpenAI schema。- 示例:
[{"type": "text", "text": "Hello world!"}]
- 示例:
如果结果与您预期不符,您可以设置 --chat-template-content-format
CLI 参数来覆盖使用的格式。
额外参数¶
vLLM 支持一些不属于 OpenAI API 的参数。要使用它们,您可以在 OpenAI 客户端中将它们作为额外参数传递。或者如果您直接使用 HTTP 调用,可以直接将它们合并到 JSON payload 中。
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
],
extra_body={
"guided_choice": ["positive", "negative"]
}
)
额外 HTTP 头¶
目前仅支持 X-Request-Id
HTTP 请求头。可以通过 --enable-request-id-headers
启用它。
请注意,在高 QPS(每秒查询数)率下启用 HTTP 头可能会显著影响性能。因此,我们建议在路由器层面(例如通过 Istio)实现 HTTP 头,而不是在 vLLM 层内部实现。有关详细信息,请参阅此 PR。
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
],
extra_headers={
"x-request-id": "sentiment-classification-00001",
}
)
print(completion._request_id)
completion = client.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="A robot may not injure a human being",
extra_headers={
"x-request-id": "completion-test",
}
)
print(completion._request_id)
API 参考¶
Completions API¶
我们的 Completions API 与 OpenAI 的 Completions API 兼容;您可以使用官方 OpenAI Python 客户端与之交互。
代码示例: examples/online_serving/openai_completion_client.py
额外参数¶
支持以下采样参数。
use_beam_search: bool = False
top_k: Optional[int] = None
min_p: Optional[float] = None
repetition_penalty: Optional[float] = None
length_penalty: float = 1.0
stop_token_ids: Optional[list[int]] = []
include_stop_str_in_output: bool = False
ignore_eos: bool = False
min_tokens: int = 0
skip_special_tokens: bool = True
spaces_between_special_tokens: bool = True
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
allowed_token_ids: Optional[list[int]] = None
prompt_logprobs: Optional[int] = None
默认支持以下额外参数
add_special_tokens: bool = Field(
default=True,
description=(
"If true (the default), special tokens (e.g. BOS) will be added to "
"the prompt."),
)
response_format: Optional[AnyResponseFormat] = Field(
default=None,
description=(
"Similar to chat completion, this parameter specifies the format "
"of output. Only {'type': 'json_object'}, {'type': 'json_schema'}"
", {'type': 'structural_tag'}, or {'type': 'text' } is supported."
),
)
guided_json: Optional[Union[str, dict, BaseModel]] = Field(
default=None,
description="If specified, the output will follow the JSON schema.",
)
guided_regex: Optional[str] = Field(
default=None,
description=(
"If specified, the output will follow the regex pattern."),
)
guided_choice: Optional[list[str]] = Field(
default=None,
description=(
"If specified, the output will be exactly one of the choices."),
)
guided_grammar: Optional[str] = Field(
default=None,
description=(
"If specified, the output will follow the context free grammar."),
)
guided_decoding_backend: Optional[str] = Field(
default=None,
description=(
"If specified, will override the default guided decoding backend "
"of the server for this specific request. If set, must be one of "
"'outlines' / 'lm-format-enforcer'"),
)
guided_whitespace_pattern: Optional[str] = Field(
default=None,
description=(
"If specified, will override the default whitespace pattern "
"for guided json decoding."),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."),
)
logits_processors: Optional[LogitsProcessors] = Field(
default=None,
description=(
"A list of either qualified names of logits processors, or "
"constructor objects, to apply when sampling. A constructor is "
"a JSON object with a required 'qualname' field specifying the "
"qualified name of the processor class/factory, and optional "
"'args' and 'kwargs' fields containing positional and keyword "
"arguments. For example: {'qualname': "
"'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
"{'param': 'value'}}."))
return_tokens_as_token_ids: Optional[bool] = Field(
default=None,
description=(
"If specified with 'logprobs', tokens are represented "
" as strings of the form 'token_id:{token_id}' so that tokens "
"that are not JSON-encodable can be identified."))
kv_transfer_params: Optional[dict[str, Any]] = Field(
default=None,
description="KVTransfer parameters used for disaggregated serving.")
Chat API¶
我们的 Chat API 与 OpenAI 的 Chat Completions API 兼容;您可以使用官方 OpenAI Python 客户端与之交互。
我们支持视觉(Vision)和音频(Audio)相关的参数;有关详细信息,请参阅我们的多模态输入指南。- 注意:不支持 image_url.detail
参数。
代码示例: examples/online_serving/openai_chat_completion_client.py
额外参数¶
支持以下采样参数。
best_of: Optional[int] = None
use_beam_search: bool = False
top_k: Optional[int] = None
min_p: Optional[float] = None
repetition_penalty: Optional[float] = None
length_penalty: float = 1.0
stop_token_ids: Optional[list[int]] = []
include_stop_str_in_output: bool = False
ignore_eos: bool = False
min_tokens: int = 0
skip_special_tokens: bool = True
spaces_between_special_tokens: bool = True
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
prompt_logprobs: Optional[int] = None
默认支持以下额外参数
echo: bool = Field(
default=False,
description=(
"If true, the new message will be prepended with the last message "
"if they belong to the same role."),
)
add_generation_prompt: bool = Field(
default=True,
description=
("If true, the generation prompt will be added to the chat template. "
"This is a parameter used by chat template in tokenizer config of the "
"model."),
)
continue_final_message: bool = Field(
default=False,
description=
("If this is set, the chat will be formatted so that the final "
"message in the chat is open-ended, without any EOS tokens. The "
"model will continue this message rather than starting a new one. "
"This allows you to \"prefill\" part of the model's response for it. "
"Cannot be used at the same time as `add_generation_prompt`."),
)
add_special_tokens: bool = Field(
default=False,
description=(
"If true, special tokens (e.g. BOS) will be added to the prompt "
"on top of what is added by the chat template. "
"For most models, the chat template takes care of adding the "
"special tokens so this should be set to false (as is the "
"default)."),
)
documents: Optional[list[dict[str, str]]] = Field(
default=None,
description=
("A list of dicts representing documents that will be accessible to "
"the model if it is performing RAG (retrieval-augmented generation)."
" If the template does not support RAG, this argument will have no "
"effect. We recommend that each document should be a dict containing "
"\"title\" and \"text\" keys."),
)
chat_template: Optional[str] = Field(
default=None,
description=(
"A Jinja template to use for this conversion. "
"As of transformers v4.44, default chat template is no longer "
"allowed, so you must provide a chat template if the tokenizer "
"does not define one."),
)
chat_template_kwargs: Optional[dict[str, Any]] = Field(
default=None,
description=("Additional kwargs to pass to the template renderer. "
"Will be accessible by the chat template."),
)
mm_processor_kwargs: Optional[dict[str, Any]] = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
guided_json: Optional[Union[str, dict, BaseModel]] = Field(
default=None,
description=("If specified, the output will follow the JSON schema."),
)
guided_regex: Optional[str] = Field(
default=None,
description=(
"If specified, the output will follow the regex pattern."),
)
guided_choice: Optional[list[str]] = Field(
default=None,
description=(
"If specified, the output will be exactly one of the choices."),
)
guided_grammar: Optional[str] = Field(
default=None,
description=(
"If specified, the output will follow the context free grammar."),
)
structural_tag: Optional[str] = Field(
default=None,
description=(
"If specified, the output will follow the structural tag schema."),
)
guided_decoding_backend: Optional[str] = Field(
default=None,
description=(
"If specified, will override the default guided decoding backend "
"of the server for this specific request. If set, must be either "
"'outlines' / 'lm-format-enforcer'"),
)
guided_whitespace_pattern: Optional[str] = Field(
default=None,
description=(
"If specified, will override the default whitespace pattern "
"for guided json decoding."),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."),
)
request_id: str = Field(
default_factory=lambda: f"{random_uuid()}",
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."),
)
logits_processors: Optional[LogitsProcessors] = Field(
default=None,
description=(
"A list of either qualified names of logits processors, or "
"constructor objects, to apply when sampling. A constructor is "
"a JSON object with a required 'qualname' field specifying the "
"qualified name of the processor class/factory, and optional "
"'args' and 'kwargs' fields containing positional and keyword "
"arguments. For example: {'qualname': "
"'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
"{'param': 'value'}}."))
return_tokens_as_token_ids: Optional[bool] = Field(
default=None,
description=(
"If specified with 'logprobs', tokens are represented "
" as strings of the form 'token_id:{token_id}' so that tokens "
"that are not JSON-encodable can be identified."))
cache_salt: Optional[str] = Field(
default=None,
description=(
"If specified, the prefix cache will be salted with the provided "
"string to prevent an attacker to guess prompts in multi-user "
"environments. The salt should be random, protected from "
"access by 3rd parties, and long enough to be "
"unpredictable (e.g., 43 characters base64-encoded, corresponding "
"to 256 bit). Not supported by vLLM engine V0."))
kv_transfer_params: Optional[dict[str, Any]] = Field(
default=None,
description="KVTransfer parameters used for disaggregated serving.")
Embeddings API¶
我们的 Embeddings API 与 OpenAI 的 Embeddings API 兼容;您可以使用官方 OpenAI Python 客户端与之交互。
如果模型有聊天模板,您可以用 messages
列表(与Chat API 的 schema 相同)替换 inputs
,这将被视为对模型的单个提示。
代码示例: examples/online_serving/openai_embedding_client.py
多模态输入¶
您可以通过为服务器定义自定义聊天模板并在请求中传递 messages
列表,将多模态输入传递给嵌入模型。请参考以下示例进行说明。
如何提供模型服务
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code \
--max-model-len 4096 \
--chat-template examples/template_vlm2vec.jinja
警告
由于 VLM2Vec 与 Phi-3.5-Vision 具有相同的模型架构,我们必须明确传递 --task embed
以在嵌入模式而不是文本生成模式下运行此模型。
此模型的自定义聊天模板与原始模板完全不同,可以在此处找到: examples/template_vlm2vec.jinja
由于请求 schema 未由 OpenAI 客户端定义,我们使用更底层的 requests
库向服务器发送请求
import requests
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={
"model": "TIGER-Lab/VLM2Vec-Full",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}],
"encoding_format": "float",
},
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])
如何提供模型服务
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code \
--max-model-len 8192 \
--chat-template examples/template_dse_qwen2_vl.jinja
警告
与 VLM2Vec 类似,我们必须明确传递 --task embed
。
此外,MrLight/dse-qwen2-2b-mrl-v1
需要一个用于嵌入的 EOS token,这由自定义聊天模板处理: examples/template_dse_qwen2_vl.jinja
警告
MrLight/dse-qwen2-2b-mrl-v1
需要一个最小图片尺寸的占位符图片用于文本查询嵌入。详细信息请参阅下面的完整代码示例。
完整示例: examples/online_serving/openai_chat_embedding_client_for_multimodal.py
额外参数¶
支持以下池化参数。
默认支持以下额外参数
add_special_tokens: bool = Field(
default=True,
description=(
"If true (the default), special tokens (e.g. BOS) will be added to "
"the prompt."),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."),
)
对于聊天式输入(即如果传递了 messages
),则支持以下额外参数
add_special_tokens: bool = Field(
default=False,
description=(
"If true, special tokens (e.g. BOS) will be added to the prompt "
"on top of what is added by the chat template. "
"For most models, the chat template takes care of adding the "
"special tokens so this should be set to false (as is the "
"default)."),
)
chat_template: Optional[str] = Field(
default=None,
description=(
"A Jinja template to use for this conversion. "
"As of transformers v4.44, default chat template is no longer "
"allowed, so you must provide a chat template if the tokenizer "
"does not define one."),
)
chat_template_kwargs: Optional[dict[str, Any]] = Field(
default=None,
description=("Additional kwargs to pass to the template renderer. "
"Will be accessible by the chat template."),
)
mm_processor_kwargs: Optional[dict[str, Any]] = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."),
)
Transcriptions API¶
我们的 Transcriptions API 与 OpenAI 的 Transcriptions API 兼容;您可以使用官方 OpenAI Python 客户端与之交互。
注意
要使用 Transcriptions API,请使用 pip install vllm[audio]
安装额外的音频依赖项。
代码示例: examples/online_serving/openai_transcription_client.py
额外参数¶
支持以下采样参数。
temperature: float = Field(default=0.0)
"""The sampling temperature, between 0 and 1.
Higher values like 0.8 will make the output more random, while lower values
like 0.2 will make it more focused / deterministic. If set to 0, the model
will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
to automatically increase the temperature until certain thresholds are hit.
"""
top_p: Optional[float] = None
"""Enables nucleus (top-p) sampling, where tokens are selected from the
smallest possible set whose cumulative probability exceeds `p`.
"""
top_k: Optional[int] = None
"""Limits sampling to the `k` most probable tokens at each step."""
min_p: Optional[float] = None
"""Filters out tokens with a probability lower than `min_p`, ensuring a
minimum likelihood threshold during sampling.
"""
seed: Optional[int] = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
"""The seed to use for sampling."""
frequency_penalty: Optional[float] = 0.0
"""The frequency penalty to use for sampling."""
repetition_penalty: Optional[float] = None
"""The repetition penalty to use for sampling."""
presence_penalty: Optional[float] = 0.0
"""The presence penalty to use for sampling."""
默认支持以下额外参数
stream: Optional[bool] = False
"""Custom field not present in the original OpenAI definition. When set,
it will enable output to be streamed in a similar fashion as the Chat
Completion endpoint.
"""
# Flattened stream option to simplify form data.
stream_include_usage: Optional[bool] = False
stream_continuous_usage_stats: Optional[bool] = False
Tokenizer API¶
我们的 Tokenizer API 是一个对 HuggingFace 风格分词器的简单封装。它包含两个端点
/tokenize
对应于调用tokenizer.encode()
。/detokenize
对应于调用tokenizer.decode()
。
Pooling API¶
我们的 Pooling API 使用池化模型编码输入提示,并返回相应的隐藏状态。
输入格式与Embeddings API 相同,但输出数据可以包含任意嵌套列表,而不仅仅是浮点数的 1-D 列表。
代码示例: examples/online_serving/openai_pooling_client.py
Classification API¶
我们的 Classification API 直接支持 Hugging Face 的序列分类模型,例如 ai21labs/Jamba-tiny-reward-dev 和 jason9693/Qwen2.5-1.5B-apeach。
我们会自动通过 as_classification_model()
包装任何其他 transformer 模型,该方法会在最后一个 token 上进行池化,连接一个 RowParallelLinear
头,并应用 softmax 以生成每个类别的概率。
代码示例: examples/online_serving/openai_classification_client.py
请求示例¶
您可以通过传递字符串数组来对多个文本进行分类
请求
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": [
"Loved the new café—coffee was great.",
"This update broke everything. Frustrating."
]
}'
响应
{
"id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
"object": "list",
"created": 1745383065,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
},
{
"index": 1,
"label": "Spoiled",
"probs": [
0.26448777318000793,
0.7355121970176697
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 20,
"total_tokens": 20,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
您也可以直接将字符串传递给 input
字段
请求
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": "Loved the new café—coffee was great."
}'
响应
{
"id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
"object": "list",
"created": 1745383213,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 10,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
额外参数¶
支持以下池化参数。
默认支持以下额外参数
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."),
)
Score API¶
我们的 Score API 可以应用交叉编码器模型或嵌入模型来预测句子对的得分。使用嵌入模型时,得分对应于每对嵌入之间的余弦相似度。通常,句子对的得分指的是两个句子之间的相似度,范围在 0 到 1 之间。
您可以在 sbert.net 找到交叉编码器模型的文档。
代码示例: examples/online_serving/openai_cross_encoder_score.py
单次推理¶
您可以同时向 text_1
和 text_2
传递一个字符串,形成一个句子对。
请求
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"text_1": "What is the capital of France?",
"text_2": "The capital of France is Paris."
}'
响应
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": 1
}
],
"usage": {}
}
批量推理¶
您可以向 text_1
传递一个字符串,向 text_2
传递一个列表,形成多个句子对,其中每对句子由 text_1
和 text_2
中的一个字符串组成。句子对的总数为 len(text_2)
。
请求
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"text_1": "What is the capital of France?",
"text_2": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
响应
{
"id": "score-request-id",
"object": "list",
"created": 693570,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": 0.001094818115234375
},
{
"index": 1,
"object": "score",
"score": 1
}
],
"usage": {}
}
您可以同时向 text_1
和 text_2
传递一个列表,形成多个句子对,其中每对句子由 text_1
中的一个字符串和 text_2
中对应的字符串组成(类似于 zip()
)。句子对的总数为 len(text_2)
。
请求
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"text_1": [
"What is the capital of Brazil?",
"What is the capital of France?"
],
"text_2": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
响应
{
"id": "score-request-id",
"object": "list",
"created": 693447,
"model": "BAAI/bge-reranker-v2-m3",
"data": [
{
"index": 0,
"object": "score",
"score": 1
},
{
"index": 1,
"object": "score",
"score": 1
}
],
"usage": {}
}
额外参数¶
支持以下池化参数。
默认支持以下额外参数
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."),
)
Re-rank API¶
我们的 Re-rank API 可以应用嵌入模型或交叉编码器模型来预测单个查询与文档列表中的每个文档之间的相关性得分。通常,句子对的得分指的是两个句子之间的相似度,范围在 0 到 1 之间。
您可以在 sbert.net 找到交叉编码器模型的文档。
Re-rank 端点支持流行的重排序模型,例如 BAAI/bge-reranker-base
和其他支持 score
任务的模型。此外,/rerank
、/v1/rerank
和 /v2/rerank
端点兼容 Jina AI 的 re-rank API 接口和 Cohere 的 re-rank API 接口,以确保与流行的开源工具兼容。
代码示例: examples/online_serving/jinaai_rerank_client.py
请求示例¶
请注意,top_n
请求参数是可选的,默认为 documents
字段的长度。结果文档将按相关性排序,并且可以使用 index
属性来确定原始顺序。
请求
curl -X 'POST' \
'http://127.0.0.1:8000/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
"Horses and cows are both animals"
]
}'
响应
{
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
"model": "BAAI/bge-reranker-base",
"usage": {
"total_tokens": 56
},
"results": [
{
"index": 1,
"document": {
"text": "The capital of France is Paris."
},
"relevance_score": 0.99853515625
},
{
"index": 0,
"document": {
"text": "The capital of Brazil is Brasilia."
},
"relevance_score": 0.0005860328674316406
}
]
}
额外参数¶
支持以下池化参数。
默认支持以下额外参数