跳到内容

OpenAI 兼容服务器

vLLM 提供了一个实现了 OpenAI Completions APIChat API 等接口的 HTTP 服务器!该功能使您可以部署模型,并使用 HTTP 客户端与它们进行交互。

您可以在终端中安装 vLLM,然后使用 vllm serve 命令启动服务器。(您也可以使用我们的 Docker 镜像。)

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123

要调用服务器,请在您常用的文本编辑器中创建一个使用 HTTP 客户端的脚本。包含您想要发送给模型的消息,然后运行该脚本。以下是使用 官方 OpenAI Python 客户端 的示例脚本。

代码
from openai import OpenAI
client = OpenAI(
    base_url="https://:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
)

print(completion.choices[0].message)

提示

vLLM 支持一些 OpenAI 不支持的参数,例如 top_k。您可以使用 OpenAI 客户端请求中的 extra_body 参数将这些参数传递给 vLLM,例如 top_k 使用 extra_body={"top_k": 50}

重要

默认情况下,如果 Hugging Face 模型仓库中存在 generation_config.json,服务器会自动应用它。这意味着某些采样参数的默认值可能会被模型创建者推荐的值覆盖。

要禁用此行为,请在启动服务器时传递 --generation-config vllm 参数。

支持的 API

我们目前支持以下 OpenAI API

此外,我们还提供以下自定义 API

聊天模板

为了使语言模型支持聊天协议,vLLM 要求模型在 Tokenizer 配置中包含聊天模板。聊天模板是一个 Jinja2 模板,它指定了如何将角色、消息和其他聊天特定的 Token 编码到输入中。

NousResearch/Meta-Llama-3-8B-Instruct 的聊天模板示例如

一些模型即使经过了指令/聊天微调,也没有提供聊天模板。对于这些模型,您可以在 --chat-template 参数中手动指定其聊天模板的文件路径,或者直接以字符串形式提供模板。如果没有聊天模板,服务器将无法处理聊天,所有聊天请求都会报错。

vllm serve <model> --chat-template ./path-to-chat-template.jinja

vLLM 社区为常用模型提供了一套聊天模板。您可以在 examples 目录下找到它们。

随着多模态聊天 API 的加入,OpenAI 规范现在接受一种新的聊天消息格式,该格式同时指定了 typetext 字段。以下是一个示例。

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"},
            ],
        },
    ],
)

大多数 LLM 的聊天模板期望 content 字段是一个字符串,但一些较新的模型(如 meta-llama/Llama-Guard-3-1B)期望内容根据请求中的 OpenAI 模式进行格式化。vLLM 尽力提供对这种格式的自动检测支持,检测结果会记录为类似 "Detected the chat template content format to be..." 的字符串,并在内部将传入的请求转换为匹配检测到的格式,格式可以是以下之一:

  • "string":一个字符串。
    • 例如:"Hello world"
  • "openai":一个字典列表,类似于 OpenAI 模式。
    • 例如:[{"type": "text", "text": "Hello world!"}]

如果结果不是您所期望的,您可以设置 --chat-template-content-format CLI 参数来强制指定要使用的格式。

额外参数

vLLM 支持一组不属于 OpenAI API 的参数。要使用它们,您可以将其作为 OpenAI 客户端中的额外参数传递,或者如果您直接使用 HTTP 调用,则直接将其合并到 JSON 数据包中。

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
    ],
    extra_body={
        "structured_outputs": {"choice": ["positive", "negative"]},
    },
)

额外 HTTP 请求头

目前仅支持 X-Request-Id HTTP 请求头。可以通过 --enable-request-id-headers 启用。

代码
completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
    ],
    extra_headers={
        "x-request-id": "sentiment-classification-00001",
    },
)
print(completion._request_id)

completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="A robot may not injure a human being",
    extra_headers={
        "x-request-id": "completion-test",
    },
)
print(completion._request_id)

离线 API 文档

FastAPI 的 /docs 端点默认需要互联网连接。若要在气隙(离线)环境中使用,请使用 --enable-offline-docs 标志。

vllm serve NousResearch/Meta-Llama-3-8B-Instruct --enable-offline-docs

API 参考

Completions API

我们的 Completions API 与 OpenAI 的 Completions API 兼容;您可以使用 官方 OpenAI Python 客户端 与其交互。

代码示例: examples/basic/online_serving/openai_completion_client.py

额外参数

支持以下采样参数

代码
    use_beam_search: bool = False
    top_k: int | None = None
    min_p: float | None = None
    repetition_penalty: float | None = None
    length_penalty: float = 1.0
    stop_token_ids: list[int] | None = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Annotated[int, Field(ge=-1, le=_INT64_MAX)] | None = None
    allowed_token_ids: list[int] | None = None
    prompt_logprobs: int | None = None

支持以下额外参数

代码
    prompt_embeds: bytes | list[bytes] | None = None
    add_special_tokens: bool = Field(
        default=True,
        description=(
            "If true (the default), special tokens (e.g. BOS) will be added to "
            "the prompt."
        ),
    )
    response_format: AnyResponseFormat | None = Field(
        default=None,
        description=(
            "Similar to chat completion, this parameter specifies the format "
            "of output. Only {'type': 'json_object'}, {'type': 'json_schema'}"
            ", {'type': 'structural_tag'}, or {'type': 'text' } is supported."
        ),
    )
    structured_outputs: StructuredOutputsParams | None = Field(
        default=None,
        description="Additional kwargs for structured outputs",
    )
    priority: int = Field(
        default=0,
        ge=_INT64_MIN,
        le=_INT64_MAX,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )

    return_tokens_as_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."
        ),
    )
    return_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified, the result will include token IDs alongside the "
            "generated text. In streaming mode, prompt_token_ids is included "
            "only in the first chunk, and token_ids contains the delta tokens "
            "for each chunk. This is useful for debugging or when you "
            "need to map generated text back to input tokens."
        ),
    )

    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )

    kv_transfer_params: dict[str, Any] | None = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.",
    )

    vllm_xargs: dict[str, str | int | float] | None = Field(
        default=None,
        description=(
            "Additional request parameters with string or "
            "numeric values, used by custom extensions."
        ),
    )

    repetition_detection: RepetitionDetectionParams | None = Field(
        default=None,
        description="Parameters for detecting repetitive N-gram patterns "
        "in output tokens. If such repetition is detected, generation will "
        "be ended early. LLMs can sometimes generate repetitive, unhelpful "
        "token patterns, stopping only when they hit the maximum output length "
        "(e.g. 'abcdabcdabcd...' or '\\emoji \\emoji \\emoji ...'). This feature "
        "can detect such behavior and terminate early, saving time and tokens.",
    )

Chat API

我们的 Chat API 与 OpenAI 的 Chat Completions API 兼容;您可以使用 官方 OpenAI Python 客户端 与其交互。

我们同时支持视觉 (Vision)音频 (Audio) 相关参数;有关更多信息,请参阅我们的多模态输入指南。

  • 注意:不支持 image_url.detail 参数。

代码示例: examples/basic/online_serving/openai_chat_completion_client.py

额外参数

支持以下采样参数

代码
    use_beam_search: bool = False
    top_k: int | None = None
    min_p: float | None = None
    repetition_penalty: float | None = None
    length_penalty: float = 1.0
    stop_token_ids: list[int] | None = []
    include_stop_str_in_output: bool = False
    ignore_eos: bool = False
    min_tokens: int = 0
    skip_special_tokens: bool = True
    spaces_between_special_tokens: bool = True
    truncate_prompt_tokens: Annotated[int, Field(ge=-1, le=_INT64_MAX)] | None = None
    prompt_logprobs: int | None = None
    allowed_token_ids: list[int] | None = None
    bad_words: list[str] = Field(default_factory=list)

支持以下额外参数

代码
    echo: bool = Field(
        default=False,
        description=(
            "If true, the new message will be prepended with the last message "
            "if they belong to the same role."
        ),
    )
    add_generation_prompt: bool = Field(
        default=True,
        description=(
            "If true, the generation prompt will be added to the chat template. "
            "This is a parameter used by chat template in tokenizer config of the "
            "model."
        ),
    )
    continue_final_message: bool = Field(
        default=False,
        description=(
            "If this is set, the chat will be formatted so that the final "
            "message in the chat is open-ended, without any EOS tokens. The "
            "model will continue this message rather than starting a new one. "
            'This allows you to "prefill" part of the model\'s response for it. '
            "Cannot be used at the same time as `add_generation_prompt`."
        ),
    )
    add_special_tokens: bool = Field(
        default=False,
        description=(
            "If true, special tokens (e.g. BOS) will be added to the prompt "
            "on top of what is added by the chat template. "
            "For most models, the chat template takes care of adding the "
            "special tokens so this should be set to false (as is the "
            "default)."
        ),
    )
    documents: list[dict[str, str]] | None = Field(
        default=None,
        description=(
            "A list of dicts representing documents that will be accessible to "
            "the model if it is performing RAG (retrieval-augmented generation)."
            " If the template does not support RAG, this argument will have no "
            "effect. We recommend that each document should be a dict containing "
            '"title" and "text" keys.'
        ),
    )
    chat_template: str | None = Field(
        default=None,
        description=(
            "A Jinja template to use for this conversion. "
            "As of transformers v4.44, default chat template is no longer "
            "allowed, so you must provide a chat template if the tokenizer "
            "does not define one."
        ),
    )
    chat_template_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=(
            "Additional keyword args to pass to the template renderer. "
            "Will be accessible by the chat template."
        ),
    )
    media_io_kwargs: dict[str, dict[str, Any]] | None = Field(
        default=None,
        description=(
            "Additional kwargs to pass to the media IO connectors, "
            "keyed by modality. Merged with engine-level media_io_kwargs."
        ),
    )
    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    structured_outputs: StructuredOutputsParams | None = Field(
        default=None,
        description="Additional kwargs for structured outputs",
    )
    priority: int = Field(
        default=0,
        ge=_INT64_MIN,
        le=_INT64_MAX,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    request_id: str = Field(
        default_factory=random_uuid,
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )

    return_tokens_as_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified with 'logprobs', tokens are represented "
            " as strings of the form 'token_id:{token_id}' so that tokens "
            "that are not JSON-encodable can be identified."
        ),
    )
    return_token_ids: bool | None = Field(
        default=None,
        description=(
            "If specified, the result will include token IDs alongside the "
            "generated text. In streaming mode, prompt_token_ids is included "
            "only in the first chunk, and token_ids contains the delta tokens "
            "for each chunk. This is useful for debugging or when you "
            "need to map generated text back to input tokens."
        ),
    )

    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )

    kv_transfer_params: dict[str, Any] | None = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.",
    )

    vllm_xargs: dict[str, str | int | float | list[str | int | float]] | None = Field(
        default=None,
        description=(
            "Additional request parameters with (list of) string or "
            "numeric values, used by custom extensions."
        ),
    )

    repetition_detection: RepetitionDetectionParams | None = Field(
        default=None,
        description="Parameters for detecting repetitive N-gram patterns "
        "in output tokens. If such repetition is detected, generation will "
        "be ended early. LLMs can sometimes generate repetitive, unhelpful "
        "token patterns, stopping only when they hit the maximum output length "
        "(e.g. 'abcdabcdabcd...' or '\\emoji \\emoji \\emoji ...'). This feature "
        "can detect such behavior and terminate early, saving time and tokens.",
    )

Responses API

我们的 Responses API 与 OpenAI 的 Responses API 兼容;您可以使用 官方 OpenAI Python 客户端 与其交互。

代码示例: examples/online_serving/openai_responses_client_with_tools.py

额外参数

支持请求对象中的以下额外参数:

代码
    request_id: str = Field(
        default_factory=lambda: f"resp_{random_uuid()}",
        description=(
            "The request_id related to this request. If the caller does "
            "not set it, a random_uuid will be generated. This id is used "
            "through out the inference process and return in response."
        ),
    )
    media_io_kwargs: dict[str, dict[str, Any]] | None = Field(
        default=None,
        description=(
            "Additional kwargs to pass to the media IO connectors, "
            "keyed by modality. Merged with engine-level media_io_kwargs."
        ),
    )
    mm_processor_kwargs: dict[str, Any] | None = Field(
        default=None,
        description=("Additional kwargs to pass to the HF processor."),
    )
    priority: int = Field(
        default=0,
        ge=_INT64_MIN,
        le=_INT64_MAX,
        description=(
            "The priority of the request (lower means earlier handling; "
            "default: 0). Any priority other than 0 will raise an error "
            "if the served model does not use priority scheduling."
        ),
    )
    cache_salt: str | None = Field(
        default=None,
        description=(
            "If specified, the prefix cache will be salted with the provided "
            "string to prevent an attacker to guess prompts in multi-user "
            "environments. The salt should be random, protected from "
            "access by 3rd parties, and long enough to be "
            "unpredictable (e.g., 43 characters base64-encoded, corresponding "
            "to 256 bit)."
        ),
    )

    enable_response_messages: bool = Field(
        default=False,
        description=(
            "Dictates whether or not to return messages as part of the "
            "response object. Currently only supported for non-background."
        ),
    )
    # similar to input_messages / output_messages in ResponsesResponse
    # we take in previous_input_messages (ie in harmony format)
    # this cannot be used in conjunction with previous_response_id
    # TODO: consider supporting non harmony messages as well
    previous_input_messages: list[OpenAIHarmonyMessage | dict] | None = None
    structured_outputs: StructuredOutputsParams | None = Field(
        default=None,
        description="Additional kwargs for structured outputs",
    )

    repetition_penalty: float | None = None
    seed: int | None = Field(None, ge=_INT64_MIN, le=_INT64_MAX)
    stop: str | list[str] | None = []
    ignore_eos: bool = False
    vllm_xargs: dict[str, str | int | float | list[str | int | float]] | None = Field(
        default=None,
        description=(
            "Additional request parameters with (list of) string or "
            "numeric values, used by custom extensions."
        ),
    )
    kv_transfer_params: dict[str, Any] | None = Field(
        default=None,
        description="KVTransfer parameters used for disaggregated serving.",
    )

支持响应对象中的以下额外参数:

代码
    # These are populated when enable_response_messages is set to True
    # NOTE: custom serialization is needed
    # see serialize_input_messages and serialize_output_messages
    input_messages: ResponseInputOutputMessage | None = Field(
        default=None,
        description=(
            "If enable_response_messages, we can show raw token input to model."
        ),
    )
    output_messages: ResponseInputOutputMessage | None = Field(
        default=None,
        description=(
            "If enable_response_messages, we can show raw token output of model."
        ),
    )

Transcriptions API

我们的 Transcriptions API 与 OpenAI 的 Transcriptions API 兼容;您可以使用 官方 OpenAI Python 客户端 与其交互。

注意

要使用 Transcriptions API,请使用 pip install vllm[audio] 安装额外的音频依赖项。

代码示例: examples/online_serving/openai_transcription_client.py

注意:目前的转录端点支持用于编码器-解码器多模态模型(如 whisper)的束搜索(beam search),但效率非常低,因为处理编码器/解码器缓存的工作正在进行中。这是一个正在持续优化的重点,将在不久的将来得到妥善处理。

API 强制限制

通过 VLLM_MAX_AUDIO_CLIP_FILESIZE_MB 环境变量设置 VLLM 可接受的最大音频文件大小(以 MB 为单位)。默认值为 25 MB。

上传音频文件

Transcriptions API 支持上传多种格式的音频文件,包括 FLAC、MP3、MP4、MPEG、MPGA、M4A、OGG、WAV 和 WEBM。

使用 OpenAI Python 客户端

代码
from openai import OpenAI

client = OpenAI(
    base_url="https://:8000/v1",
    api_key="token-abc123",
)

# Upload audio file from disk
with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="openai/whisper-large-v3-turbo",
        file=audio_file,
        language="en",
        response_format="verbose_json",
    )

print(transcription.text)

使用 curl 配合 multipart/form-data

代码
curl -X POST "https://:8000/v1/audio/transcriptions" \
  -H "Authorization: Bearer token-abc123" \
  -F "[email protected]" \
  -F "model=openai/whisper-large-v3-turbo" \
  -F "language=en" \
  -F "response_format=verbose_json"

支持的参数

  • file:要转录的音频文件(必需)
  • model:用于转录的模型(必需)
  • language:语言代码(例如 "en", "zh")(可选)
  • prompt:用于引导转录风格的可选文本(可选)
  • response_format:响应格式("json", "text")(可选)
  • temperature:0 到 1 之间的采样温度(可选)

有关包含采样参数和 vLLM 扩展在内的完整受支持参数列表,请参阅 协议定义

响应格式

对于 verbose_json 响应格式

代码
{
  "text": "Hello, this is a transcription of the audio file.",
  "language": "en",
  "duration": 5.42,
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, this is a transcription",
      "tokens": [50364, 938, 428, 307, 275, 28347],
      "temperature": 0.0,
      "avg_logprob": -0.245,
      "compression_ratio": 1.235,
      "no_speech_prob": 0.012
    }
  ]
}

目前 “verbose_json” 响应格式不支持 no_speech_prob。

额外参数

支持以下采样参数

代码
    use_beam_search: bool = False
    """Whether or not beam search should be used."""

    n: int = 1
    """The number of beams to be used in beam search."""

    length_penalty: float = 1.0
    """Length penalty to be used for beam search."""

    include_stop_str_in_output: bool = False
    """Whether to include the stop strings in output text."""

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

    top_p: float | None = None
    """Enables nucleus (top-p) sampling, where tokens are selected from the
    smallest possible set whose cumulative probability exceeds `p`.
    """

    top_k: int | None = None
    """Limits sampling to the `k` most probable tokens at each step."""

    min_p: float | None = None
    """Filters out tokens with a probability lower than `min_p`, ensuring a
    minimum likelihood threshold during sampling.
    """

    seed: int | None = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    """The seed to use for sampling."""

    frequency_penalty: float | None = 0.0
    """The frequency penalty to use for sampling."""

    repetition_penalty: float | None = None
    """The repetition penalty to use for sampling."""

    presence_penalty: float | None = 0.0
    """The presence penalty to use for sampling."""

    max_completion_tokens: int | None = None
    """The maximum number of tokens to generate."""

支持以下额外参数

代码
    # Flattened stream option to simplify form data.
    stream_include_usage: bool | None = False
    stream_continuous_usage_stats: bool | None = False

    vllm_xargs: dict[str, str | int | float | bool] | None = Field(
        default=None,
        description=(
            "Additional request parameters with string or "
            "numeric values, used by custom extensions."
        ),
    )

Translations API

我们的 Translation API 与 OpenAI 的 Translations API 兼容;您可以使用 官方 OpenAI Python 客户端 与其交互。Whisper 模型可以将音频从 55 种支持的非英语语言之一翻译成英语。请注意,流行的 openai/whisper-large-v3-turbo 模型不支持翻译。

注意

要使用 Translation API,请使用 pip install vllm[audio] 安装额外的音频依赖项。

代码示例: examples/online_serving/openai_translation_client.py

额外参数

支持以下采样参数

    use_beam_search: bool = False
    """Whether or not beam search should be used."""

    n: int = 1
    """The number of beams to be used in beam search."""

    length_penalty: float = 1.0
    """Length penalty to be used for beam search."""

    include_stop_str_in_output: bool = False
    """Whether to include the stop strings in output text."""

    seed: int | None = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
    """The seed to use for sampling."""

    temperature: float = Field(default=0.0)
    """The sampling temperature, between 0 and 1.

    Higher values like 0.8 will make the output more random, while lower values
    like 0.2 will make it more focused / deterministic. If set to 0, the model
    will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
    to automatically increase the temperature until certain thresholds are hit.
    """

支持以下额外参数

    language: str | None = None
    """The language of the input audio we translate from.

    Supplying the input language in
    [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) format
    will improve accuracy.
    """

    to_language: str | None = None
    """The language of the input audio we translate to.

    Please note that this is not supported by all models, refer to the specific
    model documentation for more details.
    For instance, Whisper only supports `to_language=en`.
    """

    stream: bool | None = False
    """Custom field not present in the original OpenAI definition. When set,
    it will enable output to be streamed in a similar fashion as the Chat
    Completion endpoint.
    """
    # Flattened stream option to simplify form data.
    stream_include_usage: bool | None = False
    stream_continuous_usage_stats: bool | None = False

    max_completion_tokens: int | None = None
    """The maximum number of tokens to generate."""

Realtime API

Realtime API 提供基于 WebSocket 的流式音频转录,允许在录制音频时进行实时语音转文字。

注意

要使用 Realtime API,请使用 uv pip install vllm[audio] 安装额外的音频依赖项。

音频格式

音频必须以 base64 编码的 PCM16 音频发送,采样率为 16kHz,单声道。

协议概览

  1. 客户端连接到 ws://host/v1/realtime
  2. 服务器发送 session.created 事件
  3. 客户端可选发送带有模型/参数的 session.update
  4. 准备好后,客户端发送 input_audio_buffer.commit
  5. 客户端发送包含 base64 PCM16 片段的 input_audio_buffer.append 事件
  6. 服务器发送包含增量文本的 transcription.delta 事件
  7. 服务器发送带有最终文本和使用情况的 transcription.done 事件
  8. 对于下一次话语,从第 5 步开始重复
  9. 客户端可选发送带有 final=True 的 input_audio_buffer.commit 以表示音频输入已结束。在流式传输音频文件时非常有用。

客户端 → 服务器事件

事件 描述
input_audio_buffer.append 发送 base64 编码的音频片段:{"type": "input_audio_buffer.append", "audio": "<base64>"}
input_audio_buffer.commit 触发转录处理或结束:{"type": "input_audio_buffer.commit", "final": bool}
session.update 配置会话:{"type": "session.update", "model": "model-name"}

服务器 → 客户端事件

事件 描述
session.created 连接已建立,包含会话 ID 和时间戳
transcription.delta 增量转录文本:{"type": "transcription.delta", "delta": "text"}
transcription.done 最终转录及使用情况统计
error 错误通知,包含消息和可选代码

示例客户端

Tokenizer API

我们的 Tokenizer API 是对 HuggingFace 风格 Tokenizer 的简单封装。它由两个端点组成:

  • /tokenize 对应于调用 tokenizer.encode()
  • /detokenize 对应于调用 tokenizer.decode()

Score API

Score 模板

一些评分模型需要特定的提示格式才能正常工作。您可以使用 --chat-template 参数指定自定义分数模板(请参阅 聊天模板)。

分数模板仅适用于 交叉编码器 (cross-encoder) 模型。如果您使用 嵌入 (embedding) 模型进行评分,vLLM 不会应用分数模板。

与聊天模板类似,分数模板接收一个 messages 列表。对于评分,每条消息都有一个 role 属性——即 "query""document"。对于通常的逐点交叉编码器,您可以预期恰好有两条消息:一条查询和一条文档。要访问查询和文档内容,请使用 Jinja 的 selectattr 过滤器:

  • Query: {{ (messages | selectattr("role", "eq", "query") | first).content }}
  • Document: {{ (messages | selectattr("role", "eq", "document") | first).content }}

这种方法比基于索引的访问 (messages[0], messages[1]) 更健壮,因为它根据语义角色选择消息。如果将来向 messages 添加了其他消息类型,它还可以避免对消息顺序的假设。

示例模板文件: examples/pooling/score/template/nemotron-rerank.jinja

生成式评分 API

/generative_scoring 端点使用 CausalLM 模型(如 Llama、Qwen、Mistral)来计算指定 Token ID 作为下一个 Token 出现的概率。每个项目(文档)与查询拼接在一起形成提示,模型预测每个标签 Token 作为该提示后下一个 Token 的可能性。这使您可以根据查询对项目进行评分——例如,询问“这是法国的首都吗?”,并根据模型回答“是”的可能性对每个城市进行评分。

当服务器以生成模型(任务 "generate")启动时,该端点会自动可用。它与使用交叉编码器、双编码器或后期交互模型的 Score API(基于池化)是分开的。

要求

  • label_token_ids 参数是必需的,并且必须包含至少 1 个 Token ID
  • 当提供 2 个标签 Token 时,分数等于 P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1]))(两个标签上的 softmax)。
  • 当提供更多标签时,分数是所有标签 Token 中第一个标签 Token 的 softmax 归一化概率。

示例

curl -X POST https://:8000/generative_scoring \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "query": "Is this city the capital of France?",
    "items": ["Paris", "London", "Berlin"],
    "label_token_ids": [9454, 2753]
  }'

在此处,每个项目都被追加到查询中以形成类似 "Is this city the capital of France? Paris", "... London" 等提示。模型然后预测下一个 Token,分数反映了“是”(Token 9454)与“否”(Token 2753)的概率。

响应
{
  "id": "generative-scoring-abc123",
  "object": "list",
  "created": 1234567890,
  "model": "Qwen/Qwen3-0.6B",
  "data": [
    {"index": 0, "object": "score", "score": 0.95},
    {"index": 1, "object": "score", "score": 0.12},
    {"index": 2, "object": "score", "score": 0.08}
  ],
  "usage": {"prompt_tokens": 45, "total_tokens": 48, "completion_tokens": 3}
}

工作原理

  1. 提示构建:对于每个项目,构建 prompt = query + item(如果 item_first=true 则为 item + query
  2. 前向传播:在每个提示上运行模型以获取下一个 Token 的 Logits
  3. 概率提取:提取指定 label_token_ids 的 Logprobs
  4. Softmax 归一化:仅在标签 Token 上应用 Softmax(当 apply_softmax=true 时)
  5. 评分:返回第一个标签 Token 的归一化概率

查找 Token ID

要查找标签的 Token ID,请使用 Tokenizer。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
yes_id = tokenizer.encode("Yes", add_special_tokens=False)[0]
no_id = tokenizer.encode("No", add_special_tokens=False)[0]
print(f"Yes: {yes_id}, No: {no_id}")

Ray Serve LLM

Ray Serve LLM 支持 vLLM 引擎的可扩展、生产级服务。它与 vLLM 紧密集成,并通过自动缩放、负载均衡和背压等功能对其进行了扩展。

关键功能

  • 提供 OpenAI 兼容的 HTTP API 以及 Python API。
  • 无需更改代码即可从单个 GPU 扩展到多节点集群。
  • 通过 Ray 仪表板和指标提供可观测性和自动缩放策略。

以下示例展示了如何使用 Ray Serve LLM 部署像 DeepSeek R1 这样的大型模型: examples/online_serving/ray_serve_deepseek.py

通过官方 Ray Serve LLM 文档 了解更多关于 Ray Serve LLM 的信息。