跳到内容

OpenAI 兼容服务器

vLLM 提供了一个实现 OpenAI Completions APIChat API 等功能的 HTTP 服务器!通过此功能,您可以使用 HTTP 客户端提供模型服务并与之交互。

在您的终端中,您可以安装 vLLM,然后使用 vllm serve 命令启动服务器。(您也可以使用我们的Docker 镜像。)

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123

要调用服务器,请在您喜欢的文本编辑器中创建一个使用 HTTP 客户端的脚本。包含您想要发送给模型的任何消息。然后运行该脚本。下面是一个使用官方 OpenAI Python 客户端的示例脚本。

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message)

提示

vLLM 支持 OpenAI 不支持的一些参数,例如 top_k。您可以使用 OpenAI 客户端通过请求的 extra_body 参数将这些参数传递给 vLLM,例如 extra_body={"top_k": 50} 用于设置 top_k

警告

默认情况下,如果存在 generation_config.json 文件,服务器会应用 Hugging Face 模型仓库中的该文件。这意味着某些采样参数的默认值可能会被模型创建者推荐的值覆盖。

要禁用此行为,请在启动服务器时传递 --generation-config vllm 参数。

支持的 API

我们目前支持以下 OpenAI API

此外,我们还提供以下自定义 API

聊天模板

为了让语言模型支持聊天协议,vLLM 要求模型在其分词器配置中包含一个聊天模板。聊天模板是一个 Jinja2 模板,指定了如何在输入中编码角色、消息和其他特定于聊天的 token。

NousResearch/Meta-Llama-3-8B-Instruct 的聊天模板示例可以在此处找到。

有些模型即使经过指令/聊天微调,也没有提供聊天模板。对于这些模型,您可以在 --chat-template 参数中手动指定其聊天模板的文件路径,或以字符串形式提供模板。如果没有聊天模板,服务器将无法处理聊天请求,所有聊天请求都将出错。

vllm serve <model> --chat-template ./path-to-chat-template.jinja

vLLM 社区为常用模型提供了一系列聊天模板。您可以在 examples 目录下找到它们。

随着多模态聊天 API 的加入,OpenAI 规范现在接受以新格式表示的聊天消息,该格式同时指定了 typetext 字段。示例如下所示

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
    ]
)

大多数 LLM 的聊天模板都期望 content 字段是字符串,但也有一些较新的模型,如 meta-llama/Llama-Guard-3-1B,期望内容按照请求中的 OpenAI schema 格式化。vLLM 会尽力自动检测这一点,并将其记录为类似 "Detected the chat template content format to be..." 的字符串,然后内部将传入的请求转换为匹配检测到的格式,格式可以是以下之一:

  • "string":字符串。
    • 示例:"Hello world"
  • "openai":字典列表,类似于 OpenAI schema。
    • 示例:[{"type": "text", "text": "Hello world!"}]

如果结果与您预期不符,您可以设置 --chat-template-content-format CLI 参数来覆盖使用的格式。

额外参数

vLLM 支持一些不属于 OpenAI API 的参数。要使用它们,您可以在 OpenAI 客户端中将它们作为额外参数传递。或者如果您直接使用 HTTP 调用,可以直接将它们合并到 JSON payload 中。

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_body={
        "guided_choice": ["positive", "negative"]
    }
)

额外 HTTP 头

目前仅支持 X-Request-Id HTTP 请求头。可以通过 --enable-request-id-headers 启用它。

请注意,在高 QPS(每秒查询数)率下启用 HTTP 头可能会显著影响性能。因此,我们建议在路由器层面(例如通过 Istio)实现 HTTP 头,而不是在 vLLM 层内部实现。有关详细信息,请参阅此 PR

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_headers={
        "x-request-id": "sentiment-classification-00001",
    }
)
print(completion._request_id)

completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    extra_headers={
        "x-request-id": "completion-test",
    }
)
print(completion._request_id)

API 参考

Completions API

我们的 Completions API 与 OpenAI 的 Completions API 兼容;您可以使用官方 OpenAI Python 客户端与之交互。

代码示例: examples/online_serving/openai_completion_client.py

额外参数

支持以下采样参数

    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[list[int]] = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    allowed_token_ids: Optional[list[int]] = None
    prompt_logprobs: Optional[int] = None

默认支持以下额外参数

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."),
    )
    response_format: Optional[AnyResponseFormat] = Field(
        default=None,
        description=(
            "Similar to chat completion, this parameter specifies the format "
            "of output. Only {'type': 'json_object'}, {'type': 'json_schema'}"
            ", {'type': 'structural_tag'}, or {'type': 'text' } is supported."
        ),
    )
    guided_json: Optional[Union[str, dict, BaseModel]] = Field(
        default=None,
        description="If specified, the output will follow the JSON schema.",
    )
    guided_regex: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the regex pattern."),
    )
    guided_choice: Optional[list[str]] = Field(
        default=None,
        description=(
            "If specified, the output will be exactly one of the choices."),
    )
    guided_grammar: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the context free grammar."),
    )
    guided_decoding_backend: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default guided decoding backend "
            "of the server for this specific request. If set, must be one of "
            "'outlines' / 'lm-format-enforcer'"),
    )
    guided_whitespace_pattern: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default whitespace pattern "
            "for guided json decoding."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    logits_processors: Optional[LogitsProcessors] = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."))

    return_tokens_as_token_ids: Optional[bool] = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."))

    kv_transfer_params: Optional[dict[str, Any]] = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.")

Chat API

我们的 Chat API 与 OpenAI 的 Chat Completions API 兼容;您可以使用官方 OpenAI Python 客户端与之交互。

我们支持视觉(Vision)音频(Audio)相关的参数;有关详细信息,请参阅我们的多模态输入指南。- 注意:不支持 image_url.detail 参数。

代码示例: examples/online_serving/openai_chat_completion_client.py

额外参数

支持以下采样参数

    best_of: Optional[int] = None
    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[list[int]] = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    prompt_logprobs: Optional[int] = None

默认支持以下额外参数

    echo: bool = Field(
        default=False,
        description=(
            "If true, the new message will be prepended with the last message "
            "if they belong to the same role."),
    )
    add_generation_prompt: bool = Field(
        default=True,
        description=
        ("If true, the generation prompt will be added to the chat template. "
         "This is a parameter used by chat template in tokenizer config of the "
         "model."),
    )
    continue_final_message: bool = Field(
        default=False,
        description=
        ("If this is set, the chat will be formatted so that the final "
         "message in the chat is open-ended, without any EOS tokens. The "
         "model will continue this message rather than starting a new one. "
         "This allows you to \"prefill\" part of the model's response for it. "
         "Cannot be used at the same time as `add_generation_prompt`."),
    )
    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."),
    )
    documents: Optional[list[dict[str, str]]] = Field(
        default=None,
        description=
        ("A list of dicts representing documents that will be accessible to "
         "the model if it is performing RAG (retrieval-augmented generation)."
         " If the template does not support RAG, this argument will have no "
         "effect. We recommend that each document should be a dict containing "
         "\"title\" and \"text\" keys."),
    )
    chat_template: Optional[str] = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."),
    )
    chat_template_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the template renderer. "
                     "Will be accessible by the chat template."),
    )
    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    guided_json: Optional[Union[str, dict, BaseModel]] = Field(
        default=None,
        description=("If specified, the output will follow the JSON schema."),
    )
    guided_regex: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the regex pattern."),
    )
    guided_choice: Optional[list[str]] = Field(
        default=None,
        description=(
            "If specified, the output will be exactly one of the choices."),
    )
    guided_grammar: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the context free grammar."),
    )
    structural_tag: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the structural tag schema."),
    )
    guided_decoding_backend: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default guided decoding backend "
            "of the server for this specific request. If set, must be either "
            "'outlines' / 'lm-format-enforcer'"),
    )
    guided_whitespace_pattern: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default whitespace pattern "
            "for guided json decoding."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )
    logits_processors: Optional[LogitsProcessors] = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."))
    return_tokens_as_token_ids: Optional[bool] = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."))
    cache_salt: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit). Not supported by vLLM engine V0."))
    kv_transfer_params: Optional[dict[str, Any]] = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.")

Embeddings API

我们的 Embeddings API 与 OpenAI 的 Embeddings API 兼容;您可以使用官方 OpenAI Python 客户端与之交互。

如果模型有聊天模板,您可以用 messages 列表(与Chat API 的 schema 相同)替换 inputs,这将被视为对模型的单个提示。

代码示例: examples/online_serving/openai_embedding_client.py

多模态输入

您可以通过为服务器定义自定义聊天模板并在请求中传递 messages 列表,将多模态输入传递给嵌入模型。请参考以下示例进行说明。

如何提供模型服务

vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
  --trust-remote-code \
  --max-model-len 4096 \
  --chat-template examples/template_vlm2vec.jinja

警告

由于 VLM2Vec 与 Phi-3.5-Vision 具有相同的模型架构,我们必须明确传递 --task embed 以在嵌入模式而不是文本生成模式下运行此模型。

此模型的自定义聊天模板与原始模板完全不同,可以在此处找到: examples/template_vlm2vec.jinja

由于请求 schema 未由 OpenAI 客户端定义,我们使用更底层的 requests 库向服务器发送请求

import requests

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

response = requests.post(
    "http://localhost:8000/v1/embeddings",
    json={
        "model": "TIGER-Lab/VLM2Vec-Full",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Represent the given image."},
            ],
        }],
        "encoding_format": "float",
    },
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])

如何提供模型服务

vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
  --trust-remote-code \
  --max-model-len 8192 \
  --chat-template examples/template_dse_qwen2_vl.jinja

警告

与 VLM2Vec 类似,我们必须明确传递 --task embed

此外,MrLight/dse-qwen2-2b-mrl-v1 需要一个用于嵌入的 EOS token,这由自定义聊天模板处理: examples/template_dse_qwen2_vl.jinja

警告

MrLight/dse-qwen2-2b-mrl-v1 需要一个最小图片尺寸的占位符图片用于文本查询嵌入。详细信息请参阅下面的完整代码示例。

完整示例: examples/online_serving/openai_chat_embedding_client_for_multimodal.py

额外参数

支持以下池化参数

    additional_data: Optional[Any] = None

默认支持以下额外参数

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

对于聊天式输入(即如果传递了 messages),则支持以下额外参数

    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."),
    )
    chat_template: Optional[str] = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."),
    )
    chat_template_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the template renderer. "
                     "Will be accessible by the chat template."),
    )
    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

Transcriptions API

我们的 Transcriptions API 与 OpenAI 的 Transcriptions API 兼容;您可以使用官方 OpenAI Python 客户端与之交互。

注意

要使用 Transcriptions API,请使用 pip install vllm[audio] 安装额外的音频依赖项。

代码示例: examples/online_serving/openai_transcription_client.py

额外参数

支持以下采样参数

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

    top_p: Optional[float] = None
    """Enables nucleus (top-p) sampling, where tokens are selected from the
    smallest possible set whose cumulative probability exceeds `p`.
    """

    top_k: Optional[int] = None
    """Limits sampling to the `k` most probable tokens at each step."""

    min_p: Optional[float] = None
    """Filters out tokens with a probability lower than `min_p`, ensuring a
    minimum likelihood threshold during sampling.
    """

    seed: Optional[int] = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    """The seed to use for sampling."""

    frequency_penalty: Optional[float] = 0.0
    """The frequency penalty to use for sampling."""

    repetition_penalty: Optional[float] = None
    """The repetition penalty to use for sampling."""

    presence_penalty: Optional[float] = 0.0
    """The presence penalty to use for sampling."""

默认支持以下额外参数

    stream: Optional[bool] = False
    """Custom field not present in the original OpenAI definition. When set,
    it will enable output to be streamed in a similar fashion as the Chat
    Completion endpoint.
    """
    # Flattened stream option to simplify form data.
    stream_include_usage: Optional[bool] = False
    stream_continuous_usage_stats: Optional[bool] = False

Tokenizer API

我们的 Tokenizer API 是一个对 HuggingFace 风格分词器的简单封装。它包含两个端点

  • /tokenize 对应于调用 tokenizer.encode()
  • /detokenize 对应于调用 tokenizer.decode()

Pooling API

我们的 Pooling API 使用池化模型编码输入提示,并返回相应的隐藏状态。

输入格式与Embeddings API 相同,但输出数据可以包含任意嵌套列表,而不仅仅是浮点数的 1-D 列表。

代码示例: examples/online_serving/openai_pooling_client.py

Classification API

我们的 Classification API 直接支持 Hugging Face 的序列分类模型,例如 ai21labs/Jamba-tiny-reward-devjason9693/Qwen2.5-1.5B-apeach

我们会自动通过 as_classification_model() 包装任何其他 transformer 模型,该方法会在最后一个 token 上进行池化,连接一个 RowParallelLinear 头,并应用 softmax 以生成每个类别的概率。

代码示例: examples/online_serving/openai_classification_client.py

请求示例

您可以通过传递字符串数组来对多个文本进行分类

请求

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": [
      "Loved the new café—coffee was great.",
      "This update broke everything. Frustrating."
    ]
  }'

响应

{
  "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
  "object": "list",
  "created": 1745383065,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    },
    {
      "index": 1,
      "label": "Spoiled",
      "probs": [
        0.26448777318000793,
        0.7355121970176697
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "total_tokens": 20,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

您也可以直接将字符串传递给 input 字段

请求

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": "Loved the new café—coffee was great."
  }'

响应

{
  "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
  "object": "list",
  "created": 1745383213,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 10,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

额外参数

支持以下池化参数

    additional_data: Optional[Any] = None

默认支持以下额外参数

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

Score API

我们的 Score API 可以应用交叉编码器模型或嵌入模型来预测句子对的得分。使用嵌入模型时,得分对应于每对嵌入之间的余弦相似度。通常,句子对的得分指的是两个句子之间的相似度,范围在 0 到 1 之间。

您可以在 sbert.net 找到交叉编码器模型的文档。

代码示例: examples/online_serving/openai_cross_encoder_score.py

单次推理

您可以同时向 text_1text_2 传递一个字符串,形成一个句子对。

请求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": "What is the capital of France?",
  "text_2": "The capital of France is Paris."
}'

响应

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

批量推理

您可以向 text_1 传递一个字符串,向 text_2 传递一个列表,形成多个句子对,其中每对句子由 text_1text_2 中的一个字符串组成。句子对的总数为 len(text_2)

请求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "text_1": "What is the capital of France?",
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

响应

{
  "id": "score-request-id",
  "object": "list",
  "created": 693570,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.001094818115234375
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

您可以同时向 text_1text_2 传递一个列表,形成多个句子对,其中每对句子由 text_1 中的一个字符串和 text_2 中对应的字符串组成(类似于 zip())。句子对的总数为 len(text_2)

请求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": [
    "What is the capital of Brazil?",
    "What is the capital of France?"
  ],
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

响应

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

额外参数

支持以下池化参数

    additional_data: Optional[Any] = None

默认支持以下额外参数

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

Re-rank API

我们的 Re-rank API 可以应用嵌入模型或交叉编码器模型来预测单个查询与文档列表中的每个文档之间的相关性得分。通常,句子对的得分指的是两个句子之间的相似度,范围在 0 到 1 之间。

您可以在 sbert.net 找到交叉编码器模型的文档。

Re-rank 端点支持流行的重排序模型,例如 BAAI/bge-reranker-base 和其他支持 score 任务的模型。此外,/rerank/v1/rerank/v2/rerank 端点兼容 Jina AI 的 re-rank API 接口Cohere 的 re-rank API 接口,以确保与流行的开源工具兼容。

代码示例: examples/online_serving/jinaai_rerank_client.py

请求示例

请注意,top_n 请求参数是可选的,默认为 documents 字段的长度。结果文档将按相关性排序,并且可以使用 index 属性来确定原始顺序。

请求

curl -X 'POST' \
  'http://127.0.0.1:8000/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-base",
  "query": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris.",
    "Horses and cows are both animals"
  ]
}'

响应

{
  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
  "model": "BAAI/bge-reranker-base",
  "usage": {
    "total_tokens": 56
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "The capital of France is Paris."
      },
      "relevance_score": 0.99853515625
    },
    {
      "index": 0,
      "document": {
        "text": "The capital of Brazil is Brasilia."
      },
      "relevance_score": 0.0005860328674316406
    }
  ]
}

额外参数

支持以下池化参数

    additional_data: Optional[Any] = None

默认支持以下额外参数

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )