OpenAI 兼容服务器¶

vLLM 提供了一个 HTTP 服务器，实现了 OpenAI 的 Completions API、Chat API 等！通过此功能，您可以提供模型并通过 HTTP 客户端与之交互。

在您的终端中，您可以安装 vLLM，然后使用 vllm serve 命令启动服务器。（您也可以使用我们的 Docker 镜像。）

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123

要调用服务器，请在您喜欢的文本编辑器中创建一个使用 HTTP 客户端的脚本。在脚本中包含您想发送给模型的任何消息。然后运行该脚本。下面是一个使用官方 OpenAI Python 客户端的示例脚本。

代码

from openai import OpenAI
client = OpenAI(
    base_url="https://:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
)

print(completion.choices[0].message)

提示

vLLM 支持一些 OpenAI API 不支持的参数，例如 top_k。您可以通过 OpenAI 客户端在请求的 extra_body 参数中传递这些参数给 vLLM，即对于 top_k，可以设置为 extra_body={"top_k": 50}。

重要

默认情况下，服务器会应用 Hugging Face 模型存储库中的 generation_config.json 文件（如果存在）。这意味着某些采样参数的默认值可以被模型创建者推荐的值覆盖。

要禁用此行为，请在启动服务器时传递 --generation-config vllm。

支持的 API¶

我们目前支持以下 OpenAI API

Completions API (/v1/completions)
- 仅适用于文本生成模型。
- 注意：suffix 参数不受支持。
Responses API (/v1/responses)
- 仅适用于文本生成模型。
Chat Completions API (/v1/chat/completions)
- 仅适用于具有聊天模板的文本生成模型。
- 注意：user 参数被忽略。
- 注意：将 parallel_tool_calls 参数设置为 false 可确保 vLLM 每个请求仅返回零个或一个工具调用。将其设置为 true（默认值）则允许每个请求返回多个工具调用。如果设置为 true，也不能保证会返回多个工具调用，因为这种行为取决于模型，并非所有模型都设计为支持并行工具调用。
Embeddings API (/v1/embeddings)
- 仅适用于嵌入模型。
Transcriptions API (/v1/audio/transcriptions)
- 仅适用于语音识别 (ASR) 模型。
Translation API (/v1/audio/translations)
- 仅适用于语音识别 (ASR) 模型。

此外，我们还有以下自定义 API

Tokenizer API (/tokenize, /detokenize)
- 适用于任何带有分词器的模型。
Pooling API (/pooling)
- 适用于所有池化模型。
Classification API (/classify)
- 仅适用于分类模型。
Score API (/score)
- 适用于嵌入模型和交叉编码器模型。
Re-rank API (/rerank, /v1/rerank, /v2/rerank)
- 实现了 Jina AI 的 v1 re-rank API
- 同时兼容 Cohere 的 v1 & v2 re-rank API
- Jina 和 Cohere 的 API 非常相似；Jina 的 API 在 rerank 端点的响应中包含额外信息。
- 仅适用于交叉编码器模型。

聊天模板¶

为了让语言模型支持聊天协议，vLLM 要求模型在其分词器配置中包含一个聊天模板。聊天模板是一个 Jinja2 模板，它指定了如何将角色、消息和其他特定于聊天的 token 编码到输入中。

NousResearch/Meta-Llama-3-8B-Instruct 的示例聊天模板可以在此处找到

有些模型即使经过指令/聊天微调，也没有提供聊天模板。对于这些模型，您可以通过 --chat-template 参数手动指定其聊天模板，参数值可以是聊天模板的文件路径或模板本身（字符串形式）。没有聊天模板，服务器将无法处理聊天，所有聊天请求都会出错。

vllm serve <model> --chat-template ./path-to-chat-template.jinja

vLLM 社区为热门模型提供了一组聊天模板。您可以在示例目录中找到。

随着多模态聊天 API 的加入，OpenAI 规范现在接受一种新的消息格式，该格式同时指定了 type 和 text 字段。下面提供了一个示例。

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"},
            ],
        },
    ],
)

大多数 LLM 的聊天模板都期望 content 字段是一个字符串，但有些较新的模型（如 meta-llama/Llama-Guard-3-1B）期望内容根据请求中的 OpenAI schema 进行格式化。vLLM 提供尽力支持来自动检测此格式，并会记录一条类似于“Detected the chat template content format to be…”的字符串，同时在内部转换传入的请求以匹配检测到的格式，该格式可以是以下之一：

"string"：字符串。
- 示例："Hello world"
"openai"：字典列表，类似于 OpenAI schema。
- 示例：[{"type": "text", "text": "Hello world!"}]

如果结果不是您期望的，您可以设置 --chat-template-content-format 命令行参数来覆盖要使用的格式。

额外参数¶

vLLM 支持一组不属于 OpenAI API 的参数。要使用它们，您可以将它们作为额外参数传递给 OpenAI 客户端。或者，如果您直接使用 HTTP 调用，可以将它们直接合并到 JSON 负载中。

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
    ],
    extra_body={
        "structured_outputs": {"choice": ["positive", "negative"]},
    },
)

额外 HTTP 头¶

目前仅支持 X-Request-Id HTTP 请求头。可以通过 --enable-request-id-headers 启用它。

代码

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
    ],
    extra_headers={
        "x-request-id": "sentiment-classification-00001",
    },
)
print(completion._request_id)

completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    extra_headers={
        "x-request-id": "completion-test",
    },
)
print(completion._request_id)

API 参考¶

Completions API¶

我们的 Completions API 与 OpenAI 的 Completions API 兼容；您可以使用官方 OpenAI Python 客户端与之交互。

代码示例：示例/online_serving/openai_completion_client.py

额外参数¶

支持以下采样参数。

代码

    use_beam_search: bool = False
    top_k: int | None = None
    min_p: float | None = None
    repetition_penalty: float | None = None
    length_penalty: float = 1.0
    stop_token_ids: list[int] | None = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
    allowed_token_ids: list[int] | None = None
    prompt_logprobs: int | None = None

支持以下额外参数

代码

    prompt_embeds: bytes | list[bytes] | None = None
    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."
        ),
    )
    response_format: AnyResponseFormat | None = Field(
        default=None,
        description=(
            "Similar to chat completion, this parameter specifies the format "
            "of output. Only {'type': 'json_object'}, {'type': 'json_schema'}"
            ", {'type': 'structural_tag'}, or {'type': 'text' } is supported."
        ),
    )
    structured_outputs: StructuredOutputsParams | None = Field(
        default=None,
        description="Additional kwargs for structured outputs",
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    logits_processors: LogitsProcessors | None = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."
        ),
    )

    return_tokens_as_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."
        ),
    )
    return_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified, the result will include token IDs alongside the "
            "generated text. In streaming mode, prompt_token_ids is included "
            "only in the first chunk, and token_ids contains the delta tokens "
            "for each chunk. This is useful for debugging or when you "
            "need to map generated text back to input tokens."
        ),
    )

    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )

    kv_transfer_params: dict[str, Any] | None = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.",
    )

    vllm_xargs: dict[str, str | int | float] | None = Field(
        default=None,
        description=(
            "Additional request parameters with string or "
            "numeric values, used by custom extensions."
        ),
    )

Chat API¶

我们的 Chat API 与 OpenAI 的 Chat Completions API 兼容；您可以使用官方 OpenAI Python 客户端与之交互。

我们同时支持 Vision 和 Audio 相关的参数；有关更多信息，请参阅我们的多模态输入指南。

注意：image_url.detail 参数不受支持。

代码示例：示例/online_serving/openai_chat_completion_client.py

额外参数¶

支持以下采样参数。

代码

    use_beam_search: bool = False
    top_k: int | None = None
    min_p: float | None = None
    repetition_penalty: float | None = None
    length_penalty: float = 1.0
    stop_token_ids: list[int] | None = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
    prompt_logprobs: int | None = None
    allowed_token_ids: list[int] | None = None
    bad_words: list[str] = Field(default_factory=list)

支持以下额外参数

代码

    echo: bool = Field(
        default=False,
        description=(
            "If true, the new message will be prepended with the last message "
            "if they belong to the same role."
        ),
    )
    add_generation_prompt: bool = Field(
        default=True,
        description=(
            "If true, the generation prompt will be added to the chat template. "
            "This is a parameter used by chat template in tokenizer config of the "
            "model."
        ),
    )
    continue_final_message: bool = Field(
        default=False,
        description=(
            "If this is set, the chat will be formatted so that the final "
            "message in the chat is open-ended, without any EOS tokens. The "
            "model will continue this message rather than starting a new one. "
            'This allows you to "prefill" part of the model\'s response for it. '
            "Cannot be used at the same time as `add_generation_prompt`."
        ),
    )
    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."
        ),
    )
    documents: list[dict[str, str]] | None = Field(
        default=None,
        description=(
            "A list of dicts representing documents that will be accessible to "
            "the model if it is performing RAG (retrieval-augmented generation)."
            " If the template does not support RAG, this argument will have no "
            "effect. We recommend that each document should be a dict containing "
            '"title" and "text" keys.'
        ),
    )
    chat_template: str | None = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."
        ),
    )
    chat_template_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."
        ),
    )
    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    structured_outputs: StructuredOutputsParams | None = Field(
        default=None,
        description="Additional kwargs for structured outputs",
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    logits_processors: LogitsProcessors | None = Field(
        default=None,
        description=(
            "A list of either qualified names of logits processors, or "
            "constructor objects, to apply when sampling. A constructor is "
            "a JSON object with a required 'qualname' field specifying the "
            "qualified name of the processor class/factory, and optional "
            "'args' and 'kwargs' fields containing positional and keyword "
            "arguments. For example: {'qualname': "
            "'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
            "{'param': 'value'}}."
        ),
    )
    return_tokens_as_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."
        ),
    )
    return_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified, the result will include token IDs alongside the "
            "generated text. In streaming mode, prompt_token_ids is included "
            "only in the first chunk, and token_ids contains the delta tokens "
            "for each chunk. This is useful for debugging or when you "
            "need to map generated text back to input tokens."
        ),
    )
    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )
    kv_transfer_params: dict[str, Any] | None = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.",
    )

    vllm_xargs: dict[str, str | int | float | list[str | int | float]] | None = Field(
        default=None,
        description=(
            "Additional request parameters with (list of) string or "
            "numeric values, used by custom extensions."
        ),
    )

Responses API¶

我们的 Responses API 与 OpenAI 的 Responses API 兼容；您可以使用官方 OpenAI Python 客户端与之交互。

代码示例：示例/online_serving/openai_responses_client_with_tools.py

额外参数¶

支持请求对象中的以下额外参数

代码

    request_id: str = Field(
        default_factory=lambda: f"resp_{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )

    enable_response_messages: bool = Field(
        default=False,
        description=(
            "Dictates whether or not to return messages as part of the "
            "response object. Currently only supported for"
            "non-background and gpt-oss only. "
        ),
    )
    # similar to input_messages / output_messages in ResponsesResponse
    # we take in previous_input_messages (ie in harmony format)
    # this cannot be used in conjunction with previous_response_id
    # TODO: consider supporting non harmony messages as well
    previous_input_messages: list[OpenAIHarmonyMessage | dict] | None = None

支持响应对象中的以下额外参数

代码

    # These are populated when enable_response_messages is set to True
    # NOTE: custom serialization is needed
    # see serialize_input_messages and serialize_output_messages
    input_messages: ResponseInputOutputMessage | None = Field(
        default=None,
        description=(
            "If enable_response_messages, we can show raw token input to model."
        ),
    )
    output_messages: ResponseInputOutputMessage | None = Field(
        default=None,
        description=(
            "If enable_response_messages, we can show raw token output of model."
        ),
    )

Embeddings API¶

我们的 Embeddings API 与 OpenAI 的 Embeddings API 兼容；您可以使用官方 OpenAI Python 客户端与之交互。

代码示例：示例/pooling/embed/openai_embedding_client.py

如果模型具有聊天模板，您可以将 inputs 替换为 messages 列表（与 Chat API 相同的 schema），这将作为单个提示输入到模型中。下面是一个用于在保留 OpenAI 类型注解的情况下调用 API 的便捷函数。

代码

from openai import OpenAI
from openai._types import NOT_GIVEN, NotGiven
from openai.types.chat import ChatCompletionMessageParam
from openai.types.create_embedding_response import CreateEmbeddingResponse

def create_chat_embeddings(
    client: OpenAI,
    *,
    messages: list[ChatCompletionMessageParam],
    model: str,
    encoding_format: Union[Literal["base64", "float"], NotGiven] = NOT_GIVEN,
) -> CreateEmbeddingResponse:
    return client.post(
        "/embeddings",
        cast_to=CreateEmbeddingResponse,
        body={"messages": messages, "model": model, "encoding_format": encoding_format},
    )

您可以通过定义服务器的自定义聊天模板并传递 messages 列表到请求中来将多模态输入传递给嵌入模型。请参阅下面的示例进行说明。

VLM2VecDSE-Qwen2-MRL

要提供模型

vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
  --trust-remote-code \
  --max-model-len 4096 \
  --chat-template examples/template_vlm2vec_phi3v.jinja

重要

由于 VLM2Vec 的模型架构与 Phi-3.5-Vision 相同，我们必须显式传递 --runner pooling 以在嵌入模式下运行此模型，而不是文本生成模式。

自定义聊天模板与该模型的原始模板完全不同，可以在此处找到：示例/template_vlm2vec_phi3v.jinja

由于请求 schema 不是由 OpenAI 客户端定义的，我们使用较低级的 requests 库向服务器发送请求。

代码

from openai import OpenAI
client = OpenAI(
    base_url="https://:8000/v1",
    api_key="EMPTY",
)
image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

response = create_chat_embeddings(
    client,
    model="TIGER-Lab/VLM2Vec-Full",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Represent the given image."},
            ],
        }
    ],
    encoding_format="float",
)

print("Image embedding output:", response.data[0].embedding)

要提供模型

vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
  --trust-remote-code \
  --max-model-len 8192 \
  --chat-template examples/template_dse_qwen2_vl.jinja

重要

与 VLM2Vec 一样，我们也必须显式传递 --runner pooling。

此外，MrLight/dse-qwen2-2b-mrl-v1 需要一个最小图像尺寸的占位符图像用于文本查询嵌入。请参阅下面的完整代码示例了解详情。

重要

MrLight/dse-qwen2-2b-mrl-v1 需要一个最小图像尺寸的占位符图像用于文本查询嵌入。请参阅下面的完整代码示例了解详情。

完整示例：示例/pooling/embed/openai_chat_embedding_client_for_multimodal.py

额外参数¶

支持以下池化参数。

    truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
    dimensions: int | None = None
    normalize: bool | None = None

默认支持以下额外参数

代码

    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."
        ),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    normalize: bool | None = Field(
        default=None,
        description="Whether to normalize the embeddings outputs. Default is True.",
    )
    embed_dtype: EmbedDType = Field(
        default="float32",
        description=(
            "What dtype to use for encoding. Default to using float32 for base64 "
            "encoding to match the OpenAI python client behavior. "
            "This parameter will affect base64 and binary_response."
        ),
    )
    endianness: Endianness = Field(
        default="native",
        description=(
            "What endianness to use for encoding. Default to using native for "
            "base64 encoding to match the OpenAI python client behavior."
            "This parameter will affect base64 and binary_response."
        ),
    )

对于类似聊天的输入（即如果传递了 messages），则支持以下额外参数

代码

    add_generation_prompt: bool = Field(
        default=False,
        description=(
            "If true, the generation prompt will be added to the chat template. "
            "This is a parameter used by chat template in tokenizer config of the "
            "model."
        ),
    )

    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."
        ),
    )
    chat_template: str | None = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."
        ),
    )
    chat_template_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."
        ),
    )
    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    normalize: bool | None = Field(
        default=None,
        description="Whether to normalize the embeddings outputs. Default is True.",
    )
    embed_dtype: EmbedDType = Field(
        default="float32",
        description=(
            "What dtype to use for encoding. Default to using float32 for base64 "
            "encoding to match the OpenAI python client behavior. "
            "This parameter will affect base64 and binary_response."
        ),
    )
    endianness: Endianness = Field(
        default="native",
        description=(
            "What endianness to use for encoding. Default to using native for "
            "base64 encoding to match the OpenAI python client behavior."
            "This parameter will affect base64 and binary_response."
        ),
    )

Transcriptions API¶

我们的 Transcriptions API 与 OpenAI 的 Transcriptions API 兼容；您可以使用官方 OpenAI Python 客户端与之交互。

注意

要使用 Transcriptions API，请使用 pip install vllm[audio] 安装带有额外音频依赖项的版本。

代码示例：示例/online_serving/openai_transcription_client.py

API 强制限制¶

通过 VLLM_MAX_AUDIO_CLIP_FILESIZE_MB 环境变量设置 VLLM 将接受的最大音频文件大小（以 MB 为单位）。默认为 25 MB。

上传音频文件¶

Transcriptions API 支持上传多种格式的音频文件，包括 FLAC、MP3、MP4、MPEG、MPGA、M4A、OGG、WAV 和 WEBM。

使用 OpenAI Python 客户端

代码

from openai import OpenAI

client = OpenAI(
    base_url="https://:8000/v1",
    api_key="token-abc123",
)

# Upload audio file from disk
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="openai/whisper-large-v3-turbo",
        file=audio_file,
        language="en",
        response_format="verbose_json",
    )

print(transcription.text)

使用 curl 进行 multipart/form-data 上传

代码

curl -X POST "https://:8000/v1/audio/transcriptions" \
  -H "Authorization: Bearer token-abc123" \
  -F "[email protected]" \
  -F "model=openai/whisper-large-v3-turbo" \
  -F "language=en" \
  -F "response_format=verbose_json"

支持的参数

file：要转录的音频文件（必需）
model：用于转录的模型（必需）
language：语言代码（例如，“en”、“zh”）（可选）
prompt：可选文本，用于指导转录风格（可选）
response_format：响应格式（“json”、“text”）（可选）
temperature：采样温度，介于 0 和 1 之间（可选）

有关支持的参数的完整列表，包括采样参数和 vLLM 扩展，请参阅协议定义。

响应格式

对于 verbose_json 响应格式

代码

{
  "text": "Hello, this is a transcription of the audio file.",
  "language": "en",
  "duration": 5.42,
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is a transcription",
      "tokens": [50364, 938, 428, 307, 275, 28347],
      "temperature": 0.0,
      "avg_logprob": -0.245,
      "compression_ratio": 1.235,
      "no_speech_prob": 0.012
    }
  ]
}

目前“verbose_json”响应格式不支持 avg_logprob、compression_ratio、no_speech_prob。

额外参数¶

支持以下采样参数。

代码

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

    top_p: float | None = None
    """Enables nucleus (top-p) sampling, where tokens are selected from the
    smallest possible set whose cumulative probability exceeds `p`.
    """

    top_k: int | None = None
    """Limits sampling to the `k` most probable tokens at each step."""

    min_p: float | None = None
    """Filters out tokens with a probability lower than `min_p`, ensuring a
    minimum likelihood threshold during sampling.
    """

    seed: int | None = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    """The seed to use for sampling."""

    frequency_penalty: float | None = 0.0
    """The frequency penalty to use for sampling."""

    repetition_penalty: float | None = None
    """The repetition penalty to use for sampling."""

    presence_penalty: float | None = 0.0
    """The presence penalty to use for sampling."""

    max_completion_tokens: int | None = None
    """The maximum number of tokens to generate."""

支持以下额外参数

代码

    # Flattened stream option to simplify form data.
    stream_include_usage: bool | None = False
    stream_continuous_usage_stats: bool | None = False

    vllm_xargs: dict[str, str | int | float] | None = Field(
        default=None,
        description=(
            "Additional request parameters with string or "
            "numeric values, used by custom extensions."
        ),
    )

Translations API¶

我们的 Translation API 与 OpenAI 的 Translations API 兼容；您可以使用官方 OpenAI Python 客户端与之交互。Whisper 模型可以将 55 种非英语支持语言中的任何一种的音频翻译成英语。请注意，流行的 openai/whisper-large-v3-turbo 模型不支持翻译。

注意

要使用 Translation API，请使用 pip install vllm[audio] 安装带有额外音频依赖项的版本。

代码示例：示例/online_serving/openai_translation_client.py

额外参数¶

支持以下采样参数。

    seed: int | None = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    """The seed to use for sampling."""

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

支持以下额外参数

    language: str | None = None
    """The language of the input audio we translate from.

    Supplying the input language in
    [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) format
    will improve accuracy.
    """

    to_language: str | None = None
    """The language of the input audio we translate to.

    Please note that this is not supported by all models, refer to the specific
    model documentation for more details.
    For instance, Whisper only supports `to_language=en`.
    """

    stream: bool | None = False
    """Custom field not present in the original OpenAI definition. When set,
    it will enable output to be streamed in a similar fashion as the Chat
    Completion endpoint.
    """
    # Flattened stream option to simplify form data.
    stream_include_usage: bool | None = False
    stream_continuous_usage_stats: bool | None = False

    max_completion_tokens: int | None = None
    """The maximum number of tokens to generate."""

Tokenizer API¶

我们的 Tokenizer API 是 HuggingFace 风格分词器的简单包装器。它包含两个端点：

/tokenize 对应调用 tokenizer.encode()。
/detokenize 对应调用 tokenizer.decode()。

Pooling API¶

我们的 Pooling API 使用池化模型对输入提示进行编码，并返回相应的隐藏状态。

输入格式与 Embeddings API 相同，但输出数据可以包含任意的嵌套列表，而不仅仅是浮点数的一维列表。

代码示例：示例/pooling/pooling/openai_pooling_client.py

Classification API¶

我们的 Classification API 直接支持 Hugging Face 的 sequence-classification 模型，例如 ai21labs/Jamba-tiny-reward-dev 和 jason9693/Qwen2.5-1.5B-apeach。

我们通过 as_seq_cls_model() 自动包装任何其他 Transformer 模型，该方法会在最后一个 token 上进行池化，附加一个 RowParallelLinear 头，并应用 softmax 来产生每个类的概率。

代码示例：示例/pooling/classify/openai_classification_client.py

示例请求¶

您可以通过传递字符串数组来对多个文本进行分类。

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": [
      "Loved the new café—coffee was great.",
      "This update broke everything. Frustrating."
    ]
  }'

响应

{
  "id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
  "object": "list",
  "created": 1745383065,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    },
    {
      "index": 1,
      "label": "Spoiled",
      "probs": [
        0.26448777318000793,
        0.7355121970176697
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "total_tokens": 20,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

您也可以直接将字符串传递给 input 字段。

curl -v "http://127.0.0.1:8000/classify" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jason9693/Qwen2.5-1.5B-apeach",
    "input": "Loved the new café—coffee was great."
  }'

响应

{
  "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
  "object": "list",
  "created": 1745383213,
  "model": "jason9693/Qwen2.5-1.5B-apeach",
  "data": [
    {
      "index": 0,
      "label": "Default",
      "probs": [
        0.565970778465271,
        0.4340292513370514
      ],
      "num_classes": 2
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 10,
    "completion_tokens": 0,
    "prompt_tokens_details": null
  }
}

额外参数¶

支持以下池化参数。

    truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
    softmax: bool | None = None
    activation: bool | None = None
    use_activation: bool | None = None

支持以下额外参数

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    softmax: bool | None = Field(
        default=None,
        description="softmax will be deprecated, please use use_activation instead.",
    )

    activation: bool | None = Field(
        default=None,
        description="activation will be deprecated, please use use_activation instead.",
    )

    use_activation: bool | None = Field(
        default=None,
        description="Whether to use activation for classification outputs. "
        "Default is True.",
    )

Score API¶

我们的 Score API 可以应用交叉编码器模型或嵌入模型来预测句子对或多模态对的分数。当使用嵌入模型时，分数对应于每个嵌入对之间的余弦相似度。通常，句子对的分数是指两个句子之间的相似度，范围从 0 到 1。

您可以在 sbert.net 上找到交叉编码器模型的文档。

代码示例：示例/pooling/score/openai_cross_encoder_score.py

单次推理¶

您可以将字符串传递给 text_1 和 text_2，形成一个句子对。

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": "What is the capital of France?",
  "text_2": "The capital of France is Paris."
}'

响应

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

批量推理¶

您可以将字符串传递给 text_1，将列表传递给 text_2，形成多个句子对，其中每个对由 text_1 和 text_2 中的字符串构成。对的总数为 len(text_2)。

请求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "text_1": "What is the capital of France?",
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

响应

{
  "id": "score-request-id",
  "object": "list",
  "created": 693570,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.001094818115234375
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

您可以将列表传递给 text_1 和 text_2，形成多个句子对，其中每个对由 text_1 中的字符串和 text_2 中的相应字符串构成（类似于 zip()）。对的总数为 len(text_2)。

请求

curl -X 'POST' \
  'http://127.0.0.1:8000/score' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-v2-m3",
  "encoding_format": "float",
  "text_1": [
    "What is the capital of Brazil?",
    "What is the capital of France?"
  ],
  "text_2": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris."
  ]
}'

响应

{
  "id": "score-request-id",
  "object": "list",
  "created": 693447,
  "model": "BAAI/bge-reranker-v2-m3",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 1
    },
    {
      "index": 1,
      "object": "score",
      "score": 1
    }
  ],
  "usage": {}
}

您可以通过在请求中传递包含多模态输入（图像等）列表的 content 来将多模态输入传递给评分模型。请参阅下面的示例进行说明。

JinaVL-Reranker

要提供模型

vllm serve jinaai/jina-reranker-m0

由于请求 schema 不是由 OpenAI 客户端定义的，我们使用较低级的 requests 库向服务器发送请求。

代码

import requests

response = requests.post(
    "https://:8000/v1/score",
    json={
        "model": "jinaai/jina-reranker-m0",
        "text_1": "slm markdown",
        "text_2": {
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
                    },
                },
            ],
        },
    },
)
response.raise_for_status()
response_json = response.json()
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])

完整示例：示例/pooling/score/openai_cross_encoder_score_for_multimodal.py

额外参数¶

支持以下池化参数。

    truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
    softmax: bool | None = None
    activation: bool | None = None
    use_activation: bool | None = None

支持以下额外参数

    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )

    softmax: bool | None = Field(
        default=None,
        description="softmax will be deprecated, please use use_activation instead.",
    )

    activation: bool | None = Field(
        default=None,
        description="activation will be deprecated, please use use_activation instead.",
    )

    use_activation: bool | None = Field(
        default=None,
        description="Whether to use activation for classification outputs. "
        "Default is True.",
    )

Re-rank API¶

我们的 Re-rank API 可以应用嵌入模型或交叉编码器模型来预测单个查询与一组文档之间的相关分数。通常，句子对的分数是指两个句子或多模态输入（图像等）之间的相似度，范围为 0 到 1。

您可以在 sbert.net 上找到交叉编码器模型的文档。

rerank 端点支持流行的 re-rank 模型，如 BAAI/bge-reranker-base 以及其他支持 score 任务的模型。此外，/rerank、/v1/rerank 和 /v2/rerank 端点与 Jina AI 的 re-rank API 接口和 Cohere 的 re-rank API 接口兼容，以确保与流行的开源工具兼容。

代码示例：示例/pooling/score/openai_reranker.py

示例请求¶

请注意，top_n 请求参数是可选的，默认值为 documents 字段的长度。结果文档将按相关性排序，index 属性可用于确定原始顺序。

请求

curl -X 'POST' \
  'http://127.0.0.1:8000/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "BAAI/bge-reranker-base",
  "query": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia.",
    "The capital of France is Paris.",
    "Horses and cows are both animals"
  ]
}'

响应

{
  "id": "rerank-fae51b2b664d4ed38f5969b612edff77",
  "model": "BAAI/bge-reranker-base",
  "usage": {
    "total_tokens": 56
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "The capital of France is Paris."
      },
      "relevance_score": 0.99853515625
    },
    {
      "index": 0,
      "document": {
        "text": "The capital of Brazil is Brasilia."
      },
      "relevance_score": 0.0005860328674316406
    }
  ]
}

额外参数¶

支持以下池化参数。

    truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
    softmax: bool | None = None
    activation: bool | None = None
    use_activation: bool | None = None

支持以下额外参数

    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )

    priority: int = Field(
        default=0,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )

    softmax: bool | None = Field(
        default=None,
        description="softmax will be deprecated, please use use_activation instead.",
    )

    activation: bool | None = Field(
        default=None,
        description="activation will be deprecated, please use use_activation instead.",
    )

    use_activation: bool | None = Field(
        default=None,
        description="Whether to use activation for classification outputs. "
        "Default is True.",
    )

Ray Serve LLM¶

Ray Serve LLM 使 vLLM 引擎能够进行可扩展的、生产级的服务。它与 vLLM 紧密集成，并增加了自动扩展、负载均衡和背压等功能。

关键功能

公开 OpenAI 兼容的 HTTP API 以及 Pythonic API。
无需更改代码即可从单 GPU 扩展到多节点集群。
通过 Ray Dashboard 和指标提供可观测性和自动扩展策略。

以下示例展示了如何使用 Ray Serve LLM 部署像 DeepSeek R1 这样的大模型：示例/online_serving/ray_serve_deepseek.py。

通过官方 Ray Serve LLM 文档了解更多关于 Ray Serve LLM 的信息。

OpenAI 兼容服务器¶

支持的 API¶

聊天模板¶

额外参数¶

额外 HTTP 头¶

API 参考¶

Completions API¶

额外参数¶

Chat API¶

额外参数¶

Responses API¶

额外参数¶

Embeddings API¶

多模态输入¶

额外参数¶

Transcriptions API¶

API 强制限制¶

上传音频文件¶

额外参数¶

Translations API¶

额外参数¶

Tokenizer API¶

Pooling API¶

Classification API¶

示例请求¶

额外参数¶

Score API¶

单次推理¶

批量推理¶

多模态输入¶

额外参数¶

Re-rank API¶

示例请求¶

额外参数¶

Ray Serve LLM¶