OpenAI 兼容服务器¶

vLLM 提供了一个 HTTP 服务器，该服务器实现了 OpenAI 的补全 API、聊天 API 等！此功能允许您使用 HTTP 客户端来服务模型并与它们进行交互。

在您的终端中，您可以安装 vLLM，然后使用vllm serve 命令启动服务器。（您也可以使用我们的 Docker 镜像。）

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123

要调用服务器，请在您偏好的文本编辑器中创建一个使用 HTTP 客户端的脚本。包含您想要发送给模型的任何消息。然后运行该脚本。下面是使用官方 OpenAI Python 客户端的示例脚本。

代码

from openai import OpenAI
client = OpenAI(
    base_url="https://:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message)

提示

vLLM 支持 OpenAI 不支持的一些参数，例如 top_k。您可以通过 OpenAI 客户端在请求的 extra_body 参数中将这些参数传递给 vLLM，即 extra_body={"top_k": 50} 用于 top_k。

重要

默认情况下，如果 Hugging Face 模型仓库中存在 generation_config.json，服务器会应用该文件。这意味着某些采样参数的默认值可以被模型创建者推荐的值覆盖。

要禁用此行为，请在启动服务器时传递 --generation-config vllm 参数。

支持的 API¶

我们目前支持以下 OpenAI API

补全 API (/v1/completions)
- 仅适用于文本生成模型 (--task generate)。
- 注意：不支持 suffix 参数。
聊天补全 API (/v1/chat/completions)
- 仅适用于具有聊天模板的文本生成模型 (--task generate)。
- 注意：parallel_tool_calls 和 user 参数将被忽略。
嵌入 API (/v1/embeddings)
- 仅适用于嵌入模型 (--task embed)。
转录 API (/v1/audio/transcriptions)
- 仅适用于自动语音识别 (ASR) 模型 (OpenAI Whisper) (--task generate)。
翻译 API (/v1/audio/translations)
- 仅适用于自动语音识别 (ASR) 模型 (OpenAI Whisper) (--task generate)。

此外，我们还有以下自定义 API

分词器 API (/tokenize, /detokenize)
- 适用于任何带有分词器的模型。
池化 API (/pooling)
- 适用于所有池化模型。
分类 API (/classify)
- 仅适用于分类模型 (--task classify)。
评分 API (/score)
- 适用于嵌入模型和交叉编码器模型 (--task score)。
重排 API (/rerank, /v1/rerank, /v2/rerank)
- 实现 Jina AI 的 v1 重排 API
- 也兼容 Cohere 的 v1 和 v2 重排 API
- Jina 和 Cohere 的 API 非常相似；Jina 的 API 在重排端点的响应中包含额外信息。
- 仅适用于交叉编码器模型 (--task score)。

聊天模板¶

为了让语言模型支持聊天协议，vLLM 要求模型在其分词器配置中包含一个聊天模板。聊天模板是一个 Jinja2 模板，它指定了角色、消息和其他聊天专用令牌在输入中如何编码。

NousResearch/Meta-Llama-3-8B-Instruct 的示例聊天模板可在此处找到

有些模型即使经过指令/聊天微调，也不提供聊天模板。对于这些模型，您可以通过 --chat-template 参数手动指定其聊天模板，可以是文件路径或字符串形式。如果没有聊天模板，服务器将无法处理聊天，并且所有聊天请求都将出错。

vllm serve <model> --chat-template ./path-to-chat-template.jinja

vLLM 社区为流行模型提供了一组聊天模板。您可以在 examples 目录下找到它们。

随着多模态聊天 API 的引入，OpenAI 规范现在接受一种新格式的聊天消息，该格式同时指定 type 和 text 字段。示例如下

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
    ]
)

大多数 LLM 的聊天模板期望 content 字段是字符串，但有些较新的模型（例如 meta-llama/Llama-Guard-3-1B）期望内容根据请求中的 OpenAI 模式进行格式化。vLLM 尽力自动检测此情况，并将其记录为类似于 "Detected the chat template content format to be..." 的字符串，然后内部将传入请求转换为与检测到的格式匹配，该格式可以是以下之一：

"string": 一个字符串。
- 示例："Hello world"
"openai": 一个字典列表，类似于 OpenAI 模式。
- 示例：[{"type": "text", "text": "Hello world!"}]

如果结果不是您所期望的，您可以设置 --chat-template-content-format CLI 参数来覆盖要使用的格式。

额外参数¶

vLLM 支持一组不属于 OpenAI API 的参数。为了使用它们，您可以在 OpenAI 客户端中将它们作为额外参数传递。或者，如果您直接使用 HTTP 调用，则直接将它们合并到 JSON 有效负载中。

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_body={
        "guided_choice": ["positive", "negative"]
    }
)

额外 HTTP 请求头¶

目前仅支持 X-Request-Id HTTP 请求头。可以通过 --enable-request-id-headers 启用它。

代码

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
    ],
    extra_headers={
        "x-request-id": "sentiment-classification-00001",
    }
)
print(completion._request_id)

completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    extra_headers={
        "x-request-id": "completion-test",
    }
)
print(completion._request_id)

API 参考¶

补全 API¶

我们的补全 API 与 OpenAI 的补全 API 兼容；您可以使用官方 OpenAI Python 客户端与其交互。

代码示例： examples/online_serving/openai_completion_client.py

额外参数¶

支持以下采样参数。

代码

    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[list[int]] = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    allowed_token_ids: Optional[list[int]] = None
    prompt_logprobs: Optional[int] = None

支持以下额外参数

代码

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."),
    )
    response_format: Optional[AnyResponseFormat] = Field(
        default=None,
        description=(
            "Similar to chat completion, this parameter specifies the format "
            "of output. Only {'type': 'json_object'}, {'type': 'json_schema'}"
            ", {'type': 'structural_tag'}, or {'type': 'text' } is supported."
        ),
    )
    guided_json: Optional[Union[str, dict, BaseModel]] = Field(
        default=None,
        description="If specified, the output will follow the JSON schema.",
    )
    guided_regex: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the regex pattern."),
    )
    guided_choice: Optional[list[str]] = Field(
        default=None,
        description=(
            "If specified, the output will be exactly one of the choices."),
    )
    guided_grammar: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the context free grammar."),
    )
    guided_decoding_backend: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default guided decoding backend "
            "of the server for this specific request. If set, must be one of "
            "'outlines' / 'lm-format-enforcer'"),
    )
    guided_whitespace_pattern: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default whitespace pattern "
            "for guided json decoding."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )
    logits_processors: Optional[LogitsProcessors] = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."))

    return_tokens_as_token_ids: Optional[bool] = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."))

    cache_salt: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit). Not supported by vLLM engine V0."))

    kv_transfer_params: Optional[dict[str, Any]] = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.")

    vllm_xargs: Optional[dict[str, Union[str, int, float]]] = Field(
        default=None,
        description=("Additional request parameters with string or "
                     "numeric values, used by custom extensions."),
    )

聊天 API¶

我们的聊天 API 与 OpenAI 的聊天补全 API 兼容；您可以使用官方 OpenAI Python 客户端与其交互。

我们支持与视觉和音频相关的参数；更多信息请参阅我们的多模态输入指南。- 注意：不支持 image_url.detail 参数。

代码示例： examples/online_serving/openai_chat_completion_client.py

额外参数¶

支持以下采样参数。

代码

    best_of: Optional[int] = None
    use_beam_search: bool = False
    top_k: Optional[int] = None
    min_p: Optional[float] = None
    repetition_penalty: Optional[float] = None
    length_penalty: float = 1.0
    stop_token_ids: Optional[list[int]] = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
    prompt_logprobs: Optional[int] = None
    allowed_token_ids: Optional[list[int]] = None
    bad_words: list[str] = Field(default_factory=list)

支持以下额外参数

代码

    echo: bool = Field(
        default=False,
        description=(
            "If true, the new message will be prepended with the last message "
            "if they belong to the same role."),
    )
    add_generation_prompt: bool = Field(
        default=True,
        description=
        ("If true, the generation prompt will be added to the chat template. "
         "This is a parameter used by chat template in tokenizer config of the "
         "model."),
    )
    continue_final_message: bool = Field(
        default=False,
        description=
        ("If this is set, the chat will be formatted so that the final "
         "message in the chat is open-ended, without any EOS tokens. The "
         "model will continue this message rather than starting a new one. "
         "This allows you to \"prefill\" part of the model's response for it. "
         "Cannot be used at the same time as `add_generation_prompt`."),
    )
    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."),
    )
    documents: Optional[list[dict[str, str]]] = Field(
        default=None,
        description=
        ("A list of dicts representing documents that will be accessible to "
         "the model if it is performing RAG (retrieval-augmented generation)."
         " If the template does not support RAG, this argument will have no "
         "effect. We recommend that each document should be a dict containing "
         "\"title\" and \"text\" keys."),
    )
    chat_template: Optional[str] = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."),
    )
    chat_template_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."),
    )
    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    guided_json: Optional[Union[str, dict, BaseModel]] = Field(
        default=None,
        description=("If specified, the output will follow the JSON schema."),
    )
    guided_regex: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the regex pattern."),
    )
    guided_choice: Optional[list[str]] = Field(
        default=None,
        description=(
            "If specified, the output will be exactly one of the choices."),
    )
    guided_grammar: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the context free grammar."),
    )
    structural_tag: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the output will follow the structural tag schema."),
    )
    guided_decoding_backend: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default guided decoding backend "
            "of the server for this specific request. If set, must be either "
            "'outlines' / 'lm-format-enforcer'"),
    )
    guided_whitespace_pattern: Optional[str] = Field(
        default=None,
        description=(
            "If specified, will override the default whitespace pattern "
            "for guided json decoding."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )
    logits_processors: Optional[LogitsProcessors] = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."))
    return_tokens_as_token_ids: Optional[bool] = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."))
    cache_salt: Optional[str] = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit). Not supported by vLLM engine V0."))
    kv_transfer_params: Optional[dict[str, Any]] = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.")

    vllm_xargs: Optional[dict[str, Union[str, int, float]]] = Field(
        default=None,
        description=("Additional request parameters with string or "
                     "numeric values, used by custom extensions."),
    )

嵌入 API¶

我们的嵌入 API 与 OpenAI 的嵌入 API 兼容；您可以使用官方 OpenAI Python 客户端与其交互。

如果模型有聊天模板，您可以用一个 messages 列表（与聊天 API 具有相同的模式）替换 inputs，该列表将被视为对模型的单个提示。

代码示例： examples/online_serving/openai_embedding_client.py

您可以通过为服务器定义自定义聊天模板并在请求中传递 messages 列表，将多模态输入传递给嵌入模型。请参阅以下示例进行说明。

VLM2VecDSE-Qwen2-MRL

服务模型

vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
  --trust-remote-code \
  --max-model-len 4096 \
  --chat-template examples/template_vlm2vec.jinja

重要

由于 VLM2Vec 与 Phi-3.5-Vision 具有相同的模型架构，我们必须明确传递 --task embed 以在嵌入模式而不是文本生成模式下运行此模型。

此模型的自定义聊天模板与原始模板完全不同，可在此处找到： examples/template_vlm2vec.jinja

由于请求模式未由 OpenAI 客户端定义，我们使用底层 requests 库向服务器发送请求。

代码

import requests

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

response = requests.post(
    "https://:8000/v1/embeddings",
    json={
        "model": "TIGER-Lab/VLM2Vec-Full",
        "messages": [{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Represent the given image."},
            ],
        }],
        "encoding_format": "float",
    },
)
response.raise_for_status()
response_json = response.json()
print("Embedding output:", response_json["data"][0]["embedding"])

服务模型

vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
  --trust-remote-code \
  --max-model-len 8192 \
  --chat-template examples/template_dse_qwen2_vl.jinja

重要

与 VLM2Vec 类似，我们必须明确传递 --task embed。

此外，MrLight/dse-qwen2-2b-mrl-v1 需要一个用于嵌入的 EOS 令牌，这由一个自定义聊天模板处理： examples/template_dse_qwen2_vl.jinja

重要

MrLight/dse-qwen2-2b-mrl-v1 需要一个最小图像大小的占位符图像用于文本查询嵌入。请参阅下面的完整代码示例了解详情。

完整示例： examples/online_serving/openai_chat_embedding_client_for_multimodal.py

额外参数¶

支持以下池化参数。

默认支持以下额外参数

代码

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )

对于类聊天输入（即，如果传递了 messages），则支持这些额外参数

代码

    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."),
    )
    chat_template: Optional[str] = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."),
    )
    chat_template_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."),
    )
    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )
    request_id: str = Field(
        default_factory=lambda: f"{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."),
    )

转录 API¶

我们的转录 API 与 OpenAI 的转录 API 兼容；您可以使用官方 OpenAI Python 客户端与其交互。

注意

要使用转录 API，请使用 pip install vllm[audio] 安装额外的音频依赖项。

代码示例： examples/online_serving/openai_transcription_client.py

API 强制限制¶

通过 VLLM_MAX_AUDIO_CLIP_FILESIZE_MB 环境变量设置 VLLM 将接受的最大音频文件大小（以 MB 为单位）。默认值为 25 MB。

额外参数¶

支持以下采样参数。

代码

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

    top_p: Optional[float] = None
    """Enables nucleus (top-p) sampling, where tokens are selected from the
    smallest possible set whose cumulative probability exceeds `p`.
    """

    top_k: Optional[int] = None
    """Limits sampling to the `k` most probable tokens at each step."""

    min_p: Optional[float] = None
    """Filters out tokens with a probability lower than `min_p`, ensuring a
    minimum likelihood threshold during sampling.
    """

    seed: Optional[int] = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    """The seed to use for sampling."""

    frequency_penalty: Optional[float] = 0.0
    """The frequency penalty to use for sampling."""

    repetition_penalty: Optional[float] = None
    """The repetition penalty to use for sampling."""

    presence_penalty: Optional[float] = 0.0
    """The presence penalty to use for sampling."""

支持以下额外参数

代码

    # Flattened stream option to simplify form data.
    stream_include_usage: Optional[bool] = False
    stream_continuous_usage_stats: Optional[bool] = False

    vllm_xargs: Optional[dict[str, Union[str, int, float]]] = Field(
        default=None,
        description=("Additional request parameters with string or "
                     "numeric values, used by custom extensions."),
    )

翻译 API¶

我们的翻译 API 与 OpenAI 的翻译 API 兼容；您可以使用官方 OpenAI Python 客户端与其交互。Whisper 模型可以将音频从 55 种非英语支持语言之一翻译成英语。请注意，流行的 openai/whisper-large-v3-turbo 模型不支持翻译。

注意

要使用翻译 API，请使用 pip install vllm[audio] 安装额外的音频依赖项。

代码示例： examples/online_serving/openai_translation_client.py

额外参数¶

支持以下采样参数。

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

支持以下额外参数

    language: Optional[str] = None
    """The language of the input audio we translate from.

    Supplying the input language in
    [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) format
    will improve accuracy.
    """

    stream: Optional[bool] = False
    """Custom field not present in the original OpenAI definition. When set,
    it will enable output to be streamed in a similar fashion as the Chat
    Completion endpoint.
    """
    # Flattened stream option to simplify form data.
    stream_include_usage: Optional[bool] = False
    stream_continuous_usage_stats: Optional[bool] = False

分词器 API¶

我们的分词器 API 是 HuggingFace 风格分词器的简单封装。它包含两个端点

/tokenize 对应于调用 tokenizer.encode()。
/detokenize 对应于调用 tokenizer.decode()。

池化 API¶

我们的池化 API 使用池化模型对输入提示进行编码，并返回相应的隐藏状态。

输入格式与嵌入 API 相同，但输出数据可以包含任意嵌套列表，而不仅仅是浮点数的一维列表。

代码示例： examples/online_serving/openai_pooling_client.py

分类 API¶

我们的分类 API 直接支持 Hugging Face 序列分类模型，例如 ai21labs/Jamba-tiny-reward-dev 和 jason9693/Qwen2.5-1.5B-apeach。

我们通过 as_seq_cls_model() 自动封装任何其他 Transformer 模型，该函数在最后一个 token 上进行池化，附加一个 RowParallelLinear 头，并应用 softmax 以生成每个类别的概率。

代码示例： examples/online_serving/openai_classification_client.py

示例请求¶

您可以通过传递字符串数组来分类多个文本

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": [
      "Loved the new café—coffee was great.",
      "This update broke everything. Frustrating."
    ]
  }'

响应

{
  "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
  "object": "list",
  "created": 1745383065,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    },
    {
      "index": 1,
      "label": "Spoiled",
      "probs": [
        0.26448777318000793,
        0.7355121970176697
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "total_tokens": 20,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

您也可以直接将字符串传递给 input 字段

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": "Loved the new café—coffee was great."
  }'

响应

{
  "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
  "object": "list",
  "created": 1745383213,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 10,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

额外参数¶

支持以下池化参数。

支持以下额外参数

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

评分 API¶

我们的评分 API 可以应用交叉编码器模型或嵌入模型来预测句子或多模态对的评分。当使用嵌入模型时，评分对应于每个嵌入对之间的余弦相似度。通常，句子对的评分指两句话之间的相似度，范围为 0 到 1。

您可以在 sbert.net 上找到交叉编码器模型的文档。

代码示例： examples/online_serving/openai_cross_encoder_score.py

单次推理¶

您可以将字符串传递给 text_1 和 text_2，形成一个单独的句子对。

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": "What is the capital of France?",
  "text_2": "The capital of France is Paris."
}'

响应

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

批量推理¶

您可以将字符串传递给 text_1，并将列表传递给 text_2，从而形成多个句子对，其中每个对都由 text_1 和 text_2 中的一个字符串构建。总对数是 len(text_2)。

请求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "text_1": "What is the capital of France?",
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

响应

{
  "id": "score-request-id",
  "object": "list",
  "created": 693570,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.001094818115234375
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

您可以将列表传递给 text_1 和 text_2，从而形成多个句子对，其中每个对都由 text_1 中的一个字符串和 text_2 中的相应字符串构建（类似于 zip()）。总对数是 len(text_2)。

请求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": [
    "What is the capital of Brazil?",
    "What is the capital of France?"
  ],
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

响应

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

您可以通过在请求中传递包含多模态输入（图像等）列表的 content，将多模态输入传递给评分模型。请参阅以下示例进行说明。

JinaVL-Reranker

服务模型

vllm serve jinaai/jina-reranker-m0

由于请求模式未由 OpenAI 客户端定义，我们使用底层 requests 库向服务器发送请求。

代码

import requests

response = requests.post(
    "https://:8000/v1/score",
    json={
        "model": "jinaai/jina-reranker-m0",
        "text_1": "slm markdown",
        "text_2": {
          "content": [
                  {
                      "type": "image_url",
                      "image_url": {
                          "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                      },
                  },
                  {
                      "type": "image_url",
                      "image_url": {
                          "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
                      },
                  },
              ]
          }
        },
)
response.raise_for_status()
response_json = response.json()
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])

完整示例： examples/online_serving/openai_cross_encoder_score_for_multimodal.py

额外参数¶

支持以下池化参数。

支持以下额外参数

    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

重排 API¶

我们的重排 API 可以应用嵌入模型或交叉编码器模型来预测单个查询与文档列表中的每个文档之间的相关性得分。通常，句子对的得分指的是两个句子或多模态输入（图像等）之间的相似度，范围为 0 到 1。

您可以在 sbert.net 上找到交叉编码器模型的文档。

重排端点支持流行的重排模型，例如 BAAI/bge-reranker-base 以及其他支持 score 任务的模型。此外，/rerank、/v1/rerank 和 /v2/rerank 端点兼容 Jina AI 的重排 API 接口和 Cohere 的重排 API 接口，以确保与流行的开源工具兼容。

代码示例： examples/online_serving/jinaai_rerank_client.py

示例请求¶

请注意，top_n 请求参数是可选的，默认为 documents 字段的长度。结果文档将按相关性排序，并且 index 属性可用于确定原始顺序。

请求

curl -X 'POST' \
  'http://127.0.0.1:8000/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-base",
  "query": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris.",
    "Horses and cows are both animals"
  ]
}'

响应

{
  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
  "model": "BAAI/bge-reranker-base",
  "usage": {
    "total_tokens": 56
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "The capital of France is Paris."
      },
      "relevance_score": 0.99853515625
    },
    {
      "index": 0,
      "document": {
        "text": "The capital of Brazil is Brasilia."
      },
      "relevance_score": 0.0005860328674316406
    }
  ]
}

额外参数¶

支持以下池化参数。

支持以下额外参数

    mm_processor_kwargs: Optional[dict[str, Any]] = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."),
    )

Ray Serve LLM¶

Ray Serve LLM 支持 vLLM 引擎的可扩展、生产级服务。它与 vLLM 紧密集成，并扩展了自动扩缩、负载均衡和背压等功能。

主要功能

公开 OpenAI 兼容的 HTTP API 以及 Pythonic API。
无需更改代码即可从单个 GPU 扩展到多节点集群。
通过 Ray 仪表板和指标提供可观察性和自动扩缩策略。

以下示例展示了如何使用 Ray Serve LLM 部署大型模型，例如 DeepSeek R1： examples/online_serving/ray_serve_deepseek.py。

通过官方的 Ray Serve LLM 文档了解更多关于 Ray Serve LLM 的信息。

OpenAI 兼容服务器¶

支持的 API¶

聊天模板¶

额外参数¶

额外 HTTP 请求头¶

API 参考¶

补全 API¶

额外参数¶

聊天 API¶

额外参数¶

嵌入 API¶

多模态输入¶

额外参数¶

转录 API¶

API 强制限制¶

额外参数¶

翻译 API¶

额外参数¶

分词器 API¶

池化 API¶

分类 API¶

示例请求¶

额外参数¶

评分 API¶

单次推理¶

批量推理¶

多模态输入¶

额外参数¶

重排 API¶

示例请求¶

额外参数¶

Ray Serve LLM¶