与 OpenAI 兼容的服务器¶
vLLM 提供了一个 HTTP 服务器,它实现了 OpenAI 的 Completions API、Chat API 等等!此功能让您可以使用 HTTP 客户端来部署模型并与之交互。
在您的终端中,您可以安装 vLLM,然后使用 vllm serve 命令启动服务器。(您也可以使用我们的 Docker 镜像。)
要调用服务器,请在您偏好的文本编辑器中创建一个使用 HTTP 客户端的脚本。包含您想发送给模型的任何消息。然后运行该脚本。以下是使用官方 OpenAI Python 客户端的示例脚本。
代码
提示
vLLM 支持一些 OpenAI 不支持的参数,例如 top_k。您可以在请求的 extra_body 参数中使用 OpenAI 客户端将这些参数传递给 vLLM,例如对于 top_k,使用 extra_body={"top_k": 50}。
重要
默认情况下,如果 Hugging Face 模型仓库中存在 generation_config.json 文件,服务器会应用它。这意味着某些采样参数的默认值可能会被模型创建者推荐的值覆盖。
要禁用此行为,请在启动服务器时传递 --generation-config vllm。
支持的 API¶
我们目前支持以下 OpenAI API
- Completions API (
/v1/completions)- 仅适用于文本生成模型。
- 注意:不支持
suffix参数。
- Chat Completions API (
/v1/chat/completions) - Embeddings API (
/v1/embeddings)- 仅适用于嵌入模型。
- Transcriptions API (
/v1/audio/transcriptions)- 仅适用于自动语音识别 (ASR) 模型。
- Translation API (
/v1/audio/translations)- 仅适用于自动语音识别 (ASR) 模型。
此外,我们还有以下自定义 API
- Tokenizer API (
/tokenize,/detokenize)- 适用于任何带有分词器的模型。
- Pooling API (
/pooling)- 适用于所有池化模型。
- Classification API (
/classify)- 仅适用于分类模型。
- Score API (
/score)- 适用于嵌入模型和交叉编码器模型。
- Re-rank API (
/rerank,/v1/rerank,/v2/rerank)- 实现了 Jina AI 的 v1 re-rank API
- 也兼容 Cohere 的 v1 和 v2 re-rank API
- Jina 和 Cohere 的 API 非常相似;Jina 的 API 在 rerank 端点的响应中包含额外信息。
- 仅适用于交叉编码器模型。
聊天模板¶
为了让语言模型支持聊天协议,vLLM 要求模型在其分词器配置中包含一个聊天模板。聊天模板是一个 Jinja2 模板,它指定了角色、消息和其他聊天特定标记在输入中的编码方式。
NousResearch/Meta-Llama-3-8B-Instruct 的一个聊天模板示例可以在这里找到。
有些模型即使经过指令/聊天微调,也没有提供聊天模板。对于这些模型,您可以在 --chat-template 参数中手动指定它们的聊天模板,可以使用聊天模板的文件路径或字符串形式的模板。如果没有聊天模板,服务器将无法处理聊天,所有聊天请求都会出错。
vLLM 社区为热门模型提供了一套聊天模板。您可以在 examples 目录下找到它们。
随着多模态聊天 API 的加入,OpenAI 规范现在接受一种新格式的聊天消息,该格式同时指定了 type 和 text 字段。下面提供了一个示例。
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"},
],
},
],
)
大多数 LLM 的聊天模板期望 content 字段是一个字符串,但也有一些较新的模型,如 meta-llama/Llama-Guard-3-1B,期望内容按照请求中的 OpenAI 模式进行格式化。vLLM 尽力提供自动检测支持,这会以类似“Detected the chat template content format to be...”的字符串形式记录下来,并在内部转换传入的请求以匹配检测到的格式,格式可以是以下之一:
"string": 字符串。- 示例:
"Hello world"
- 示例:
"openai": 字典列表,类似于 OpenAI 模式。- 示例:
[{"type": "text", "text": "Hello world!"}]
- 示例:
如果结果不符合您的预期,您可以设置 --chat-template-content-format CLI 参数来覆盖要使用的格式。
额外参数¶
vLLM 支持一组不属于 OpenAI API 的参数。为了使用它们,您可以将它们作为 OpenAI 客户端中的额外参数传递。或者,如果您直接使用 HTTP 调用,可以直接将它们合并到 JSON 负载中。
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
],
extra_body={
"structured_outputs": {"choice": ["positive", "negative"]},
},
)
额外的 HTTP 标头¶
目前仅支持 X-Request-Id HTTP 请求标头。可以通过 --enable-request-id-headers 启用。
代码
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"},
],
extra_headers={
"x-request-id": "sentiment-classification-00001",
},
)
print(completion._request_id)
completion = client.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="A robot may not injure a human being",
extra_headers={
"x-request-id": "completion-test",
},
)
print(completion._request_id)
API 参考¶
Completions API¶
我们的 Completions API 与 OpenAI 的 Completions API 兼容;您可以使用官方的 OpenAI Python 客户端与其交互。
代码示例: examples/online_serving/openai_completion_client.py
额外参数¶
支持以下采样参数。
代码
use_beam_search: bool = False
top_k: int | None = None
min_p: float | None = None
repetition_penalty: float | None = None
length_penalty: float = 1.0
stop_token_ids: list[int] | None = []
include_stop_str_in_output: bool = False
ignore_eos: bool = False
min_tokens: int = 0
skip_special_tokens: bool = True
spaces_between_special_tokens: bool = True
truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
allowed_token_ids: list[int] | None = None
prompt_logprobs: int | None = None
支持以下额外参数
代码
prompt_embeds: bytes | list[bytes] | None = None
add_special_tokens: bool = Field(
default=True,
description=(
"If true (the default), special tokens (e.g. BOS) will be added to "
"the prompt."
),
)
response_format: AnyResponseFormat | None = Field(
default=None,
description=(
"Similar to chat completion, this parameter specifies the format "
"of output. Only {'type': 'json_object'}, {'type': 'json_schema'}"
", {'type': 'structural_tag'}, or {'type': 'text' } is supported."
),
)
structured_outputs: StructuredOutputsParams | None = Field(
default=None,
description="Additional kwargs for structured outputs",
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
logits_processors: LogitsProcessors | None = Field(
default=None,
description=(
"A list of either qualified names of logits processors, or "
"constructor objects, to apply when sampling. A constructor is "
"a JSON object with a required 'qualname' field specifying the "
"qualified name of the processor class/factory, and optional "
"'args' and 'kwargs' fields containing positional and keyword "
"arguments. For example: {'qualname': "
"'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
"{'param': 'value'}}."
),
)
return_tokens_as_token_ids: bool | None = Field(
default=None,
description=(
"If specified with 'logprobs', tokens are represented "
" as strings of the form 'token_id:{token_id}' so that tokens "
"that are not JSON-encodable can be identified."
),
)
return_token_ids: bool | None = Field(
default=None,
description=(
"If specified, the result will include token IDs alongside the "
"generated text. In streaming mode, prompt_token_ids is included "
"only in the first chunk, and token_ids contains the delta tokens "
"for each chunk. This is useful for debugging or when you "
"need to map generated text back to input tokens."
),
)
cache_salt: str | None = Field(
default=None,
description=(
"If specified, the prefix cache will be salted with the provided "
"string to prevent an attacker to guess prompts in multi-user "
"environments. The salt should be random, protected from "
"access by 3rd parties, and long enough to be "
"unpredictable (e.g., 43 characters base64-encoded, corresponding "
"to 256 bit)."
),
)
kv_transfer_params: dict[str, Any] | None = Field(
default=None,
description="KVTransfer parameters used for disaggregated serving.",
)
vllm_xargs: dict[str, str | int | float] | None = Field(
default=None,
description=(
"Additional request parameters with string or "
"numeric values, used by custom extensions."
),
)
Chat API¶
我们的 Chat API 与 OpenAI 的 Chat Completions API 兼容;您可以使用官方的 OpenAI Python 客户端与其交互。
我们支持与视觉和音频相关的参数;请参阅我们的多模态输入指南了解更多信息。
- 注意:不支持
image_url.detail参数。
代码示例: examples/online_serving/openai_chat_completion_client.py
额外参数¶
支持以下采样参数。
代码
use_beam_search: bool = False
top_k: int | None = None
min_p: float | None = None
repetition_penalty: float | None = None
length_penalty: float = 1.0
stop_token_ids: list[int] | None = []
include_stop_str_in_output: bool = False
ignore_eos: bool = False
min_tokens: int = 0
skip_special_tokens: bool = True
spaces_between_special_tokens: bool = True
truncate_prompt_tokens: Annotated[int, Field(ge=-1)] | None = None
prompt_logprobs: int | None = None
allowed_token_ids: list[int] | None = None
bad_words: list[str] = Field(default_factory=list)
支持以下额外参数
代码
echo: bool = Field(
default=False,
description=(
"If true, the new message will be prepended with the last message "
"if they belong to the same role."
),
)
add_generation_prompt: bool = Field(
default=True,
description=(
"If true, the generation prompt will be added to the chat template. "
"This is a parameter used by chat template in tokenizer config of the "
"model."
),
)
continue_final_message: bool = Field(
default=False,
description=(
"If this is set, the chat will be formatted so that the final "
"message in the chat is open-ended, without any EOS tokens. The "
"model will continue this message rather than starting a new one. "
'This allows you to "prefill" part of the model\'s response for it. '
"Cannot be used at the same time as `add_generation_prompt`."
),
)
add_special_tokens: bool = Field(
default=False,
description=(
"If true, special tokens (e.g. BOS) will be added to the prompt "
"on top of what is added by the chat template. "
"For most models, the chat template takes care of adding the "
"special tokens so this should be set to false (as is the "
"default)."
),
)
documents: list[dict[str, str]] | None = Field(
default=None,
description=(
"A list of dicts representing documents that will be accessible to "
"the model if it is performing RAG (retrieval-augmented generation)."
" If the template does not support RAG, this argument will have no "
"effect. We recommend that each document should be a dict containing "
'"title" and "text" keys.'
),
)
chat_template: str | None = Field(
default=None,
description=(
"A Jinja template to use for this conversion. "
"As of transformers v4.44, default chat template is no longer "
"allowed, so you must provide a chat template if the tokenizer "
"does not define one."
),
)
chat_template_kwargs: dict[str, Any] | None = Field(
default=None,
description=(
"Additional keyword args to pass to the template renderer. "
"Will be accessible by the chat template."
),
)
mm_processor_kwargs: dict[str, Any] | None = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
structured_outputs: StructuredOutputsParams | None = Field(
default=None,
description="Additional kwargs for structured outputs",
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
logits_processors: LogitsProcessors | None = Field(
default=None,
description=(
"A list of either qualified names of logits processors, or "
"constructor objects, to apply when sampling. A constructor is "
"a JSON object with a required 'qualname' field specifying the "
"qualified name of the processor class/factory, and optional "
"'args' and 'kwargs' fields containing positional and keyword "
"arguments. For example: {'qualname': "
"'my_module.MyLogitsProcessor', 'args': [1, 2], 'kwargs': "
"{'param': 'value'}}."
),
)
return_tokens_as_token_ids: bool | None = Field(
default=None,
description=(
"If specified with 'logprobs', tokens are represented "
" as strings of the form 'token_id:{token_id}' so that tokens "
"that are not JSON-encodable can be identified."
),
)
return_token_ids: bool | None = Field(
default=None,
description=(
"If specified, the result will include token IDs alongside the "
"generated text. In streaming mode, prompt_token_ids is included "
"only in the first chunk, and token_ids contains the delta tokens "
"for each chunk. This is useful for debugging or when you "
"need to map generated text back to input tokens."
),
)
cache_salt: str | None = Field(
default=None,
description=(
"If specified, the prefix cache will be salted with the provided "
"string to prevent an attacker to guess prompts in multi-user "
"environments. The salt should be random, protected from "
"access by 3rd parties, and long enough to be "
"unpredictable (e.g., 43 characters base64-encoded, corresponding "
"to 256 bit)."
),
)
kv_transfer_params: dict[str, Any] | None = Field(
default=None,
description="KVTransfer parameters used for disaggregated serving.",
)
vllm_xargs: dict[str, str | int | float | list[str | int | float]] | None = Field(
default=None,
description=(
"Additional request parameters with (list of) string or "
"numeric values, used by custom extensions."
),
)
Embeddings API¶
我们的 Embeddings API 与 OpenAI 的 Embeddings API 兼容;您可以使用官方的 OpenAI Python 客户端与其交互。
代码示例: examples/pooling/embed/openai_embedding_client.py
如果模型有聊天模板,您可以将 inputs 替换为 messages 列表(与Chat API模式相同),这将被视为模型的单个提示。这里有一个方便的函数,用于在保留 OpenAI 类型注解的同时调用 API。
代码
from openai import OpenAI
from openai._types import NOT_GIVEN, NotGiven
from openai.types.chat import ChatCompletionMessageParam
from openai.types.create_embedding_response import CreateEmbeddingResponse
def create_chat_embeddings(
client: OpenAI,
*,
messages: list[ChatCompletionMessageParam],
model: str,
encoding_format: Union[Literal["base64", "float"], NotGiven] = NOT_GIVEN,
) -> CreateEmbeddingResponse:
return client.post(
"/embeddings",
cast_to=CreateEmbeddingResponse,
body={"messages": messages, "model": model, "encoding_format": encoding_format},
)
多模态输入¶
您可以通过为服务器定义一个自定义聊天模板,并在请求中传递一个 messages 列表,将多模态输入传递给嵌入模型。请参考下面的示例进行说明。
要部署模型:
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
--trust-remote-code \
--max-model-len 4096 \
--chat-template examples/template_vlm2vec_phi3v.jinja
重要
由于 VLM2Vec 与 Phi-3.5-Vision 具有相同的模型架构,我们必须显式传递 --runner pooling 以在嵌入模式而非文本生成模式下运行此模型。
自定义聊天模板与该模型的原始模板完全不同,可以在这里找到: examples/template_vlm2vec_phi3v.jinja
由于请求模式不是由 OpenAI 客户端定义的,我们使用底层的 requests 库向服务器发送请求。
代码
from openai import OpenAI
client = OpenAI(
base_url="https://:8000/v1",
api_key="EMPTY",
)
image_url = "https://vllm-public-assets.s3.us-west-2.amazonaws.com/vision_model_images/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
response = create_chat_embeddings(
client,
model="TIGER-Lab/VLM2Vec-Full",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}
],
encoding_format="float",
)
print("Image embedding output:", response.data[0].embedding)
要部署模型:
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
--trust-remote-code \
--max-model-len 8192 \
--chat-template examples/template_dse_qwen2_vl.jinja
重要
与 VLM2Vec 一样,我们必须显式传递 --runner pooling。
此外,MrLight/dse-qwen2-2b-mrl-v1 需要一个 EOS 标记用于嵌入,这由一个自定义聊天模板处理: examples/template_dse_qwen2_vl.jinja
重要
MrLight/dse-qwen2-2b-mrl-v1 需要一个最小图像尺寸的占位符图像用于文本查询嵌入。详情请见下面的完整代码示例。
完整示例: examples/pooling/embed/openai_chat_embedding_client_for_multimodal.py
额外参数¶
支持以下池化参数。
truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
dimensions: int | None = None
normalize: bool | None = None
默认支持以下额外参数
代码
add_special_tokens: bool = Field(
default=True,
description=(
"If true (the default), special tokens (e.g. BOS) will be added to "
"the prompt."
),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
normalize: bool | None = Field(
default=None,
description="Whether to normalize the embeddings outputs. Default is True.",
)
embed_dtype: EmbedDType = Field(
default="float32",
description=(
"What dtype to use for encoding. Default to using float32 for base64 "
"encoding to match the OpenAI python client behavior. "
"This parameter will affect base64 and binary_response."
),
)
endianness: Endianness = Field(
default="native",
description=(
"What endianness to use for encoding. Default to using native for "
"base64 encoding to match the OpenAI python client behavior."
"This parameter will affect base64 and binary_response."
),
)
对于类聊天输入(即,如果传递了 messages),则支持这些额外参数
代码
add_generation_prompt: bool = Field(
default=False,
description=(
"If true, the generation prompt will be added to the chat template. "
"This is a parameter used by chat template in tokenizer config of the "
"model."
),
)
add_special_tokens: bool = Field(
default=False,
description=(
"If true, special tokens (e.g. BOS) will be added to the prompt "
"on top of what is added by the chat template. "
"For most models, the chat template takes care of adding the "
"special tokens so this should be set to false (as is the "
"default)."
),
)
chat_template: str | None = Field(
default=None,
description=(
"A Jinja template to use for this conversion. "
"As of transformers v4.44, default chat template is no longer "
"allowed, so you must provide a chat template if the tokenizer "
"does not define one."
),
)
chat_template_kwargs: dict[str, Any] | None = Field(
default=None,
description=(
"Additional keyword args to pass to the template renderer. "
"Will be accessible by the chat template."
),
)
mm_processor_kwargs: dict[str, Any] | None = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
normalize: bool | None = Field(
default=None,
description="Whether to normalize the embeddings outputs. Default is True.",
)
embed_dtype: EmbedDType = Field(
default="float32",
description=(
"What dtype to use for encoding. Default to using float32 for base64 "
"encoding to match the OpenAI python client behavior. "
"This parameter will affect base64 and binary_response."
),
)
endianness: Endianness = Field(
default="native",
description=(
"What endianness to use for encoding. Default to using native for "
"base64 encoding to match the OpenAI python client behavior."
"This parameter will affect base64 and binary_response."
),
)
Transcriptions API¶
我们的 Transcriptions API 与 OpenAI 的 Transcriptions API 兼容;您可以使用官方的 OpenAI Python 客户端与其交互。
注意
要使用 Transcriptions API,请使用 pip install vllm[audio] 安装额外的音频依赖项。
代码示例: examples/online_serving/openai_transcription_client.py
API 强制限制¶
通过 VLLM_MAX_AUDIO_CLIP_FILESIZE_MB 环境变量设置 VLLM 将接受的最大音频文件大小(以 MB 为单位)。默认为 25 MB。
上传音频文件¶
Transcriptions API 支持上传多种格式的音频文件,包括 FLAC、MP3、MP4、MPEG、MPGA、M4A、OGG、WAV 和 WEBM。
使用 OpenAI Python 客户端
代码
from openai import OpenAI
client = OpenAI(
base_url="https://:8000/v1",
api_key="token-abc123",
)
# Upload audio file from disk
with open("audio.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="openai/whisper-large-v3-turbo",
file=audio_file,
language="en",
response_format="verbose_json",
)
print(transcription.text)
使用 curl 和 multipart/form-data
代码
curl -X POST "https://:8000/v1/audio/transcriptions" \
-H "Authorization: Bearer token-abc123" \
-F "[email protected]" \
-F "model=openai/whisper-large-v3-turbo" \
-F "language=en" \
-F "response_format=verbose_json"
支持的参数
file: 要转录的音频文件(必需)model: 用于转录的模型(必需)language: 语言代码(例如 "en"、"zh")(可选)prompt: 用于指导转录风格的可选文本(可选)response_format: 响应格式("json"、"text")(可选)temperature: 采样温度,介于 0 和 1 之间(可选)
有关支持参数的完整列表,包括采样参数和 vLLM 扩展,请参阅协议定义。
响应格式
对于 verbose_json 响应格式
代码
{
"text": "Hello, this is a transcription of the audio file.",
"language": "en",
"duration": 5.42,
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 2.5,
"text": "Hello, this is a transcription",
"tokens": [50364, 938, 428, 307, 275, 28347],
"temperature": 0.0,
"avg_logprob": -0.245,
"compression_ratio": 1.235,
"no_speech_prob": 0.012
}
]
}
目前“verbose_json”响应格式不支持 avg_logprob、compression_ratio、no_speech_prob。
额外参数¶
支持以下采样参数。
代码
temperature: float = Field(default=0.0)
"""The sampling temperature, between 0 and 1.
Higher values like 0.8 will make the output more random, while lower values
like 0.2 will make it more focused / deterministic. If set to 0, the model
will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
to automatically increase the temperature until certain thresholds are hit.
"""
top_p: float | None = None
"""Enables nucleus (top-p) sampling, where tokens are selected from the
smallest possible set whose cumulative probability exceeds `p`.
"""
top_k: int | None = None
"""Limits sampling to the `k` most probable tokens at each step."""
min_p: float | None = None
"""Filters out tokens with a probability lower than `min_p`, ensuring a
minimum likelihood threshold during sampling.
"""
seed: int | None = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
"""The seed to use for sampling."""
frequency_penalty: float | None = 0.0
"""The frequency penalty to use for sampling."""
repetition_penalty: float | None = None
"""The repetition penalty to use for sampling."""
presence_penalty: float | None = 0.0
"""The presence penalty to use for sampling."""
支持以下额外参数
代码
# Flattened stream option to simplify form data.
stream_include_usage: bool | None = False
stream_continuous_usage_stats: bool | None = False
vllm_xargs: dict[str, str | int | float] | None = Field(
default=None,
description=(
"Additional request parameters with string or "
"numeric values, used by custom extensions."
),
)
Translations API¶
我们的 Translation API 与 OpenAI 的 Translations API 兼容;您可以使用官方的 OpenAI Python 客户端与其交互。Whisper 模型可以将 55 种受支持的非英语语言之一的音频翻译成英语。请注意,流行的 openai/whisper-large-v3-turbo 模型不支持翻译。
注意
要使用 Translation API,请使用 pip install vllm[audio] 安装额外的音频依赖项。
代码示例: examples/online_serving/openai_translation_client.py
额外参数¶
支持以下采样参数。
seed: int | None = Field(None, ge=_LONG_INFO.min, le=_LONG_INFO.max)
"""The seed to use for sampling."""
temperature: float = Field(default=0.0)
"""The sampling temperature, between 0 and 1.
Higher values like 0.8 will make the output more random, while lower values
like 0.2 will make it more focused / deterministic. If set to 0, the model
will use [log probability](https://en.wikipedia.org/wiki/Log_probability)
to automatically increase the temperature until certain thresholds are hit.
"""
支持以下额外参数
language: str | None = None
"""The language of the input audio we translate from.
Supplying the input language in
[ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) format
will improve accuracy.
"""
to_language: str | None = None
"""The language of the input audio we translate to.
Please note that this is not supported by all models, refer to the specific
model documentation for more details.
For instance, Whisper only supports `to_language=en`.
"""
stream: bool | None = False
"""Custom field not present in the original OpenAI definition. When set,
it will enable output to be streamed in a similar fashion as the Chat
Completion endpoint.
"""
# Flattened stream option to simplify form data.
stream_include_usage: bool | None = False
stream_continuous_usage_stats: bool | None = False
Tokenizer API¶
我们的 Tokenizer API 是对 HuggingFace 风格分词器的简单封装。它由两个端点组成:
/tokenize对应于调用tokenizer.encode()。/detokenize对应于调用tokenizer.decode()。
Pooling API¶
我们的 Pooling API 使用池化模型对输入提示进行编码,并返回相应的隐藏状态。
输入格式与 Embeddings API 相同,但输出数据可以包含任意嵌套列表,而不仅仅是浮点数的一维列表。
代码示例: examples/pooling/pooling/openai_pooling_client.py
Classification API¶
我们的 Classification API 直接支持 Hugging Face 序列分类模型,例如 ai21labs/Jamba-tiny-reward-dev 和 jason9693/Qwen2.5-1.5B-apeach。
我们通过 as_seq_cls_model() 自动包装任何其他 transformer,它会在最后一个标记上进行池化,附加一个 RowParallelLinear 头,并应用 softmax 来生成每个类别的概率。
代码示例: examples/pooling/classify/openai_classification_client.py
请求示例¶
您可以通过传递一个字符串数组来分类多个文本
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": [
"Loved the new café—coffee was great.",
"This update broke everything. Frustrating."
]
}'
响应
{
"id": "classify-7c87cac407b749a6935d8c7ce2a8fba2",
"object": "list",
"created": 1745383065,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
},
{
"index": 1,
"label": "Spoiled",
"probs": [
0.26448777318000793,
0.7355121970176697
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 20,
"total_tokens": 20,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
您也可以直接将字符串传递给 input 字段
curl -v "http://127.0.0.1:8000/classify" \
-H "Content-Type: application/json" \
-d '{
"model": "jason9693/Qwen2.5-1.5B-apeach",
"input": "Loved the new café—coffee was great."
}'
响应
{
"id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
"object": "list",
"created": 1745383213,
"model": "jason9693/Qwen2.5-1.5B-apeach",
"data": [
{
"index": 0,
"label": "Default",
"probs": [
0.565970778465271,
0.4340292513370514
],
"num_classes": 2
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 10,
"completion_tokens": 0,
"prompt_tokens_details": null
}
}
额外参数¶
支持以下池化参数。
truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
softmax: bool | None = None
activation: bool | None = None
use_activation: bool | None = None
支持以下额外参数
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
add_special_tokens: bool = Field(
default=True,
description=(
"If true (the default), special tokens (e.g. BOS) will be added to "
"the prompt."
),
)
request_id: str = Field(
default_factory=random_uuid,
description=(
"The request_id related to this request. If the caller does "
"not set it, a random_uuid will be generated. This id is used "
"through out the inference process and return in response."
),
)
softmax: bool | None = Field(
default=None,
description="softmax will be deprecated, please use use_activation instead.",
)
activation: bool | None = Field(
default=None,
description="activation will be deprecated, please use use_activation instead.",
)
use_activation: bool | None = Field(
default=None,
description="Whether to use activation for classification outputs. "
"Default is True.",
)
Score API¶
我们的 Score API 可以应用交叉编码器模型或嵌入模型来预测句子或多模态对的分数。当使用嵌入模型时,分数对应于每个嵌入对之间的余弦相似度。通常,句子对的分数指的是两个句子之间的相似性,范围在 0 到 1 之间。
您可以在 sbert.net 找到交叉编码器模型的文档。
代码示例: examples/pooling/score/openai_cross_encoder_score.py
单次推理¶
您可以将一个字符串同时传递给 text_1 和 text_2,形成一个单独的句子对。
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"text_1": "What is the capital of France?",
"text_2": "The capital of France is Paris."
}'
响应
批量推理¶
您可以将一个字符串传递给 text_1,一个列表传递给 text_2,形成多个句子对,其中每个对由 text_1 和 text_2 中的一个字符串构成。总对数为 len(text_2)。
请求
响应
您可以将一个列表同时传递给 text_1 和 text_2,形成多个句子对,其中每个对由 text_1 中的一个字符串和 text_2 中对应的字符串构成(类似于 zip())。总对数为 len(text_2)。
请求
curl -X 'POST' \
'http://127.0.0.1:8000/score' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-v2-m3",
"encoding_format": "float",
"text_1": [
"What is the capital of Brazil?",
"What is the capital of France?"
],
"text_2": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris."
]
}'
响应
多模态输入¶
您可以通过在请求中传递包含多模态输入列表(图像等)的 content,将多模态输入传递给评分模型。请参考下面的示例进行说明。
要部署模型:
由于请求模式不是由 OpenAI 客户端定义的,我们使用底层的 requests 库向服务器发送请求。
代码
import requests
response = requests.post(
"https://:8000/v1/score",
json={
"model": "jinaai/jina-reranker-m0",
"text_1": "slm markdown",
"text_2": {
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
},
},
{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
},
},
],
},
},
)
response.raise_for_status()
response_json = response.json()
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])
完整示例: examples/pooling/score/openai_cross_encoder_score_for_multimodal.py
额外参数¶
支持以下池化参数。
truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
softmax: bool | None = None
activation: bool | None = None
use_activation: bool | None = None
支持以下额外参数
mm_processor_kwargs: dict[str, Any] | None = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
softmax: bool | None = Field(
default=None,
description="softmax will be deprecated, please use use_activation instead.",
)
activation: bool | None = Field(
default=None,
description="activation will be deprecated, please use use_activation instead.",
)
use_activation: bool | None = Field(
default=None,
description="Whether to use activation for classification outputs. "
"Default is True.",
)
Re-rank API¶
我们的 Re-rank API 可以应用嵌入模型或交叉编码器模型,来预测单个查询与文档列表中的每个文档之间的相关性分数。通常,句子对的分数指的是两个句子或多模态输入(图像等)之间的相似性,范围在 0 到 1 之间。
您可以在 sbert.net 找到交叉编码器模型的文档。
rerank 端点支持流行的重排模型,如 BAAI/bge-reranker-base 以及其他支持 score 任务的模型。此外,/rerank、/v1/rerank 和 /v2/rerank 端点与 Jina AI 的 re-rank API 接口 和 Cohere 的 re-rank API 接口 都兼容,以确保与流行的开源工具的兼容性。
代码示例: examples/pooling/score/jinaai_rerank_client.py
请求示例¶
请注意,top_n 请求参数是可选的,默认为 documents 字段的长度。结果文档将按相关性排序,index 属性可用于确定原始顺序。
请求
curl -X 'POST' \
'http://127.0.0.1:8000/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "BAAI/bge-reranker-base",
"query": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia.",
"The capital of France is Paris.",
"Horses and cows are both animals"
]
}'
响应
{
"id": "rerank-fae51b2b664d4ed38f5969b612edff77",
"model": "BAAI/bge-reranker-base",
"usage": {
"total_tokens": 56
},
"results": [
{
"index": 1,
"document": {
"text": "The capital of France is Paris."
},
"relevance_score": 0.99853515625
},
{
"index": 0,
"document": {
"text": "The capital of Brazil is Brasilia."
},
"relevance_score": 0.0005860328674316406
}
]
}
额外参数¶
支持以下池化参数。
truncate_prompt_tokens: Annotated[int, msgspec.Meta(ge=-1)] | None = None
softmax: bool | None = None
activation: bool | None = None
use_activation: bool | None = None
支持以下额外参数
mm_processor_kwargs: dict[str, Any] | None = Field(
default=None,
description=("Additional kwargs to pass to the HF processor."),
)
priority: int = Field(
default=0,
description=(
"The priority of the request (lower means earlier handling; "
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."
),
)
softmax: bool | None = Field(
default=None,
description="softmax will be deprecated, please use use_activation instead.",
)
activation: bool | None = Field(
default=None,
description="activation will be deprecated, please use use_activation instead.",
)
use_activation: bool | None = Field(
default=None,
description="Whether to use activation for classification outputs. "
"Default is True.",
)
Ray Serve LLM¶
Ray Serve LLM 实现了 vLLM 引擎的可扩展、生产级部署。它与 vLLM 紧密集成,并通过自动扩缩、负载均衡和背压等功能对其进行扩展。
关键能力
- 提供与 OpenAI 兼容的 HTTP API 以及 Pythonic API。
- 无需更改代码即可从单个 GPU 扩展到多节点集群。
- 通过 Ray 仪表板和指标提供可观测性和自动扩缩策略。
以下示例展示了如何使用 Ray Serve LLM 部署像 DeepSeek R1 这样的大型模型: examples/online_serving/ray_serve_deepseek.py。
通过官方Ray Serve LLM 文档了解更多关于 Ray Serve LLM 的信息。