LLM 类

LLM 类#

class vllm.LLM(model: str, tokenizer: str | None = None, tokenizer_mode: str = 'auto', skip_tokenizer_init: bool = False, trust_remote_code: bool = False, allowed_local_media_path: str = '', tensor_parallel_size: int = 1, dtype: str = 'auto', quantization: str | None = None, revision: str | None = None, tokenizer_revision: str | None = None, seed: int | None = None, gpu_memory_utilization: float = 0.9, swap_space: float = 4, cpu_offload_gb: float = 0, enforce_eager: bool | None = None, max_seq_len_to_capture: int = 8192, disable_custom_all_reduce: bool = False, disable_async_output_proc: bool = False, hf_token: bool | str | None = None, hf_overrides: dict[str, Any] | Callable[[transformers.PretrainedConfig], transformers.PretrainedConfig] | None = None, mm_processor_kwargs: dict[str, Any] | None = None, task: Literal['auto', 'generate', 'embedding', 'embed', 'classify', 'score', 'reward', 'transcription'] = 'auto', override_pooler_config: PoolerConfig | None = None, compilation_config: int | dict[str, Any] | None = None, **kwargs)[source]#

用于从给定提示和采样参数生成文本的 LLM。

此类包括一个分词器、一个语言模型（可能分布在多个 GPU 上），以及为中间状态（又称 KV 缓存）分配的 GPU 内存空间。给定一批提示和采样参数，此类使用智能批处理机制和高效的内存管理，从模型生成文本。

参数:

model – HuggingFace Transformers 模型的名称或路径。
tokenizer – HuggingFace Transformers 分词器的名称或路径。
tokenizer_mode – 分词器模式。“auto” 将在可用时使用快速分词器，“slow” 将始终使用慢速分词器。
skip_tokenizer_init – 如果为 true，则跳过分词器和反分词器的初始化。期望来自输入的有效 prompt_token_ids 和 None 作为 prompt。
trust_remote_code – 下载模型和分词器时信任远程代码（例如，来自 HuggingFace）。
allowed_local_media_path – 允许 API 请求从服务器文件系统指定的目录读取本地图像或视频。这是一个安全风险。应仅在受信任的环境中启用。
tensor_parallel_size – 用于张量并行分布式执行的 GPU 数量。
dtype – 模型权重和激活的数据类型。目前，我们支持 float32、float16 和 bfloat16。如果为 auto，我们使用模型配置文件中指定的 torch_dtype 属性。但是，如果配置中的 torch_dtype 为 float32，我们将改用 float16。
quantization – 用于量化模型权重的方法。目前，我们支持 “awq”、“gptq” 和 “fp8”（实验性）。如果为 None，我们首先检查模型配置文件中的 quantization_config 属性。如果为 None，我们假设模型权重未量化，并使用 dtype 来确定权重的数据类型。
revision – 要使用的特定模型版本。它可以是分支名称、标签名称或提交 ID。
tokenizer_revision – 要使用的特定分词器版本。它可以是分支名称、标签名称或提交 ID。
seed – 用于初始化采样随机数生成器的种子。
gpu_memory_utilization – 为模型权重、激活和 KV 缓存保留的 GPU 内存比率（介于 0 和 1 之间）。较高的值将增加 KV 缓存大小，从而提高模型的吞吐量。但是，如果该值过高，则可能导致内存不足 (OOM) 错误。
swap_space – 用作交换空间的每个 GPU 的 CPU 内存大小 (GiB)。当请求的 best_of 采样参数大于 1 时，这可用于临时存储请求的状态。如果所有请求都将具有 best_of=1，则可以安全地将其设置为 0。请注意，best_of 仅在 V0 中受支持。否则，太小的值可能会导致内存不足 (OOM) 错误。
cpu_offload_gb – 用于卸载模型权重的 CPU 内存大小 (GiB)。这实际上增加了可用于保存模型权重的 GPU 内存空间，但代价是每次前向传递都需要 CPU-GPU 数据传输。
enforce_eager – 是否强制执行 eager 执行。如果为 True，我们将禁用 CUDA 图，并始终在 eager 模式下执行模型。如果为 False，我们将混合使用 CUDA 图和 eager 执行。
max_seq_len_to_capture – CUDA 图覆盖的最大序列长度。当序列的上下文长度大于此值时，我们将回退到 eager 模式。此外，对于编码器-解码器模型，如果编码器输入的序列长度大于此值，我们将回退到 eager 模式。
disable_custom_all_reduce – 请参阅 ParallelConfig
disable_async_output_proc – 禁用异步输出处理。这可能会导致性能降低。
hf_token – 用作远程文件 HTTP Bearer 授权的令牌。如果为 True，将使用运行 huggingface-cli login 时生成的令牌（存储在 ~/.huggingface 中）。
hf_overrides – 如果是字典，则包含要转发到 HuggingFace 配置的参数。如果是可调用对象，则调用它来更新 HuggingFace 配置。
compilation_config – 整数或字典。如果是整数，则用作编译优化级别。如果是字典，则可以指定完整的编译配置。
**kwargs – EngineArgs 的参数。（请参阅引擎参数）

注意

此类旨在用于离线推理。对于在线服务，请改用 AsyncLLMEngine 类。

DEPRECATE_INIT_POSARGS: ClassVar[bool] = True[source]#: 一个标志，用于切换是否弃用 LLM.__init__() 中的位置参数。

DEPRECATE_LEGACY: ClassVar[bool] = True[source]#: 一个标志，用于切换是否弃用旧版 generate/encode API。

apply_model(func: Callable[[torch.nn.Module], _R]) → list[_R][source]#: 在每个工作进程内部的模型上直接运行一个函数，并返回每个工作进程的结果。

beam_search(prompts: list[Union[vllm.inputs.data.TokensPrompt, vllm.inputs.data.TextPrompt]], params: BeamSearchParams) → list[vllm.beam_search.BeamSearchOutput][source]#

使用束搜索生成序列。

参数:

prompts – 提示列表。每个提示可以是字符串或标记 ID 列表。
params – 束搜索参数。

TODO：束搜索如何与长度惩罚、频率惩罚和停止标准等协同工作？

chat(messages: list[Union[openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam, openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam, openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam, openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam, openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam, openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam, vllm.entrypoints.chat_utils.CustomChatCompletionMessageParam]] | list[list[Union[openai.types.chat.chat_completion_developer_message_param.ChatCompletionDeveloperMessageParam, openai.types.chat.chat_completion_system_message_param.ChatCompletionSystemMessageParam, openai.types.chat.chat_completion_user_message_param.ChatCompletionUserMessageParam, openai.types.chat.chat_completion_assistant_message_param.ChatCompletionAssistantMessageParam, openai.types.chat.chat_completion_tool_message_param.ChatCompletionToolMessageParam, openai.types.chat.chat_completion_function_message_param.ChatCompletionFunctionMessageParam, vllm.entrypoints.chat_utils.CustomChatCompletionMessageParam]]], sampling_params: SamplingParams | list[vllm.sampling_params.SamplingParams] | None = None, use_tqdm: bool = True, lora_request: LoRARequest | None = None, chat_template: str | None = None, chat_template_content_format: Literal['auto', 'string', 'openai'] = 'auto', add_generation_prompt: bool = True, continue_final_message: bool = False, tools: list[dict[str, Any]] | None = None, mm_processor_kwargs: dict[str, Any] | None = None) → list[vllm.outputs.RequestOutput][source]#

为聊天对话生成回复。

聊天对话会使用分词器转换为文本提示，并调用 generate() 方法来生成回复。

多模态输入可以像传递给 OpenAI API 一样传递。

参数:

messages –
对话列表或单个对话。
- 每个对话都表示为消息列表。
- 每条消息都是一个字典，包含 ‘role’ 和 ‘content’ 键。
sampling_params – 文本生成的采样参数。如果为 None，则使用默认采样参数。当为单个值时，它将应用于每个提示。当为列表时，该列表的长度必须与提示的长度相同，并且它与提示逐个配对。
use_tqdm – 是否使用 tqdm 显示进度条。
lora_request – 用于生成的 LoRA 请求（如果有）。
chat_template – 用于构建聊天的模板。如果未提供，将使用模型的默认聊天模板。
chat_template_content_format –
渲染消息内容的格式。
- “string” 将内容渲染为字符串。示例："Who are you?"
- “openai” 将内容渲染为字典列表，类似于 OpenAI 模式。示例：[{"type": "text", "text": "Who are you?"}]
add_generation_prompt – 如果为 True，则向每条消息添加生成模板。
continue_final_message – 如果为 True，则继续对话中的最后一条消息，而不是开始新消息。如果 add_generation_prompt 也为 True，则不能为 True。
mm_processor_kwargs – 此聊天请求的多模态处理器 kwarg 覆盖。仅用于离线请求。

返回:

一个 RequestOutput 对象列表，其中包含生成的回复，顺序与输入消息相同。

为每个提示生成类别 logits。

此类会自动批量处理给定的提示，同时考虑内存约束。为了获得最佳性能，请将所有提示放入一个列表中，并将其传递给此方法。

参数:

prompts – LLM 的提示。您可以传递提示序列以进行批量推理。有关每个提示格式的更多详细信息，请参阅 PromptType。
use_tqdm – 是否使用 tqdm 显示进度条。
lora_request – 用于生成的 LoRA 请求（如果有）。
prompt_adapter_request – 用于生成的 Prompt Adapter 请求（如果有）。

返回:

一个 ClassificationRequestOutput 对象列表，其中包含的嵌入向量的顺序与输入提示相同。

collective_rpc(method: str | Callable[[...], _R], timeout: float | None = None, args: tuple = (), kwargs: dict[str, Any] | None = None) → list[_R][source]#

在所有工作进程上执行 RPC 调用。

参数:

method –
要执行的工作进程方法名称，或可序列化并发送到所有工作进程以执行的可调用对象。

如果该方法是可调用对象，则除了在 args 和 kwargs 中传递的参数之外，它还应该接受一个额外的 self 参数。self 参数将是工作进程对象。
timeout – 等待执行的最长时间（秒）。超时时引发 TimeoutError。None 表示无限期等待。
args – 传递给工作进程方法的位置参数。
kwargs – 传递给工作进程方法的关键字参数。

返回:

包含来自每个工作进程的结果的列表。

注意

建议使用此 API 仅传递控制消息，并设置数据平面通信以传递数据。

为每个提示生成嵌入向量。

此类会自动批量处理给定的提示，同时考虑内存约束。为了获得最佳性能，请将所有提示放入一个列表中，并将其传递给此方法。

参数:

prompts – LLM 的提示。您可以传递提示序列以进行批量推理。有关每个提示格式的更多详细信息，请参阅 PromptType。
use_tqdm – 是否使用 tqdm 显示进度条。
lora_request – 用于生成的 LoRA 请求（如果有）。
prompt_adapter_request – 用于生成的 Prompt Adapter 请求（如果有）。

返回:

一个 EmbeddingRequestOutput 对象列表，其中包含的嵌入向量的顺序与输入提示相同。

encode(prompts: PromptType | Sequence[PromptType], /, pooling_params: PoolingParams | Sequence[PoolingParams] | None = None, *, use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None) → list[PoolingRequestOutput][source]#

encode(prompts: str, pooling_params: PoolingParams | Sequence[PoolingParams] | None = None, prompt_token_ids: list[int] | None = None, use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None) → list[PoolingRequestOutput]

encode(prompts: list[str], pooling_params: PoolingParams | Sequence[PoolingParams] | None = None, prompt_token_ids: list[list[int]] | None = None, use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None) → list[PoolingRequestOutput]

encode(prompts: str | None = None, pooling_params: PoolingParams | Sequence[PoolingParams] | None = None, *, prompt_token_ids: list[int], use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None) → list[PoolingRequestOutput]

encode(prompts: list[str] | None = None, pooling_params: PoolingParams | Sequence[PoolingParams] | None = None, *, prompt_token_ids: list[list[int]], use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None) → list[PoolingRequestOutput]

encode(prompts: None, pooling_params: None, prompt_token_ids: list[int] | list[list[int]], use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None) → list[PoolingRequestOutput]

对与输入提示对应的隐藏状态应用池化。

此类会自动批量处理给定的提示，同时考虑内存约束。为了获得最佳性能，请将所有提示放入一个列表中，并将其传递给此方法。

参数:

prompts – LLM 的提示。您可以传递提示序列以进行批量推理。有关每个提示格式的更多详细信息，请参阅 PromptType。
pooling_params – 池化参数。如果为 None，我们将使用默认池化参数。
use_tqdm – 是否使用 tqdm 显示进度条。
lora_request – 用于生成的 LoRA 请求（如果有）。
prompt_adapter_request – 用于生成的 Prompt Adapter 请求（如果有）。

返回:

一个 PoolingRequestOutput 对象列表，其中包含与输入提示顺序相同的池化隐藏状态。

注意

使用 prompts 和 prompt_token_ids 作为关键字参数被认为是过时的，将来可能会被弃用。您应该改为通过 inputs 参数传递它们。

generate(prompts: PromptType | Sequence[PromptType], /, sampling_params: SamplingParams | Sequence[SamplingParams] | None = None, *, use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None, guided_options_request: LLMGuidedOptions | GuidedDecodingRequest | None = None) → list[RequestOutput][source]#

generate(prompts: str, sampling_params: SamplingParams | list[SamplingParams] | None = None, prompt_token_ids: list[int] | None = None, use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None, guided_options_request: LLMGuidedOptions | GuidedDecodingRequest | None = None) → list[RequestOutput]

generate(prompts: list[str], sampling_params: SamplingParams | list[SamplingParams] | None = None, prompt_token_ids: list[list[int]] | None = None, use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None, guided_options_request: LLMGuidedOptions | GuidedDecodingRequest | None = None) → list[RequestOutput]

generate(prompts: str | None = None, sampling_params: SamplingParams | list[SamplingParams] | None = None, *, prompt_token_ids: list[int], use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None, guided_options_request: LLMGuidedOptions | GuidedDecodingRequest | None = None) → list[RequestOutput]

generate(prompts: list[str] | None = None, sampling_params: SamplingParams | list[SamplingParams] | None = None, *, prompt_token_ids: list[list[int]], use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None, guided_options_request: LLMGuidedOptions | GuidedDecodingRequest | None = None) → list[RequestOutput]

generate(prompts: None, sampling_params: None, prompt_token_ids: list[int] | list[list[int]], use_tqdm: bool = True, lora_request: list[LoRARequest] | LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None, guided_options_request: LLMGuidedOptions | GuidedDecodingRequest | None = None) → list[RequestOutput]

生成输入提示的补全。

此类会自动批量处理给定的提示，同时考虑内存约束。为了获得最佳性能，请将所有提示放入一个列表中，并将其传递给此方法。

参数:

prompts – LLM 的提示。您可以传递提示序列以进行批量推理。有关每个提示格式的更多详细信息，请参阅 PromptType。
sampling_params – 文本生成的采样参数。如果为 None，则使用默认采样参数。当为单个值时，它将应用于每个提示。当为列表时，该列表的长度必须与提示的长度相同，并且它与提示逐个配对。
use_tqdm – 是否使用 tqdm 显示进度条。
lora_request – 用于生成的 LoRA 请求（如果有）。
prompt_adapter_request – 用于生成的 Prompt Adapter 请求（如果有）。
priority – 请求的优先级（如果有）。仅在启用优先级调度策略时适用。

返回:

包含生成的补全的 RequestOutput 对象列表，顺序与输入提示相同。

注意

使用 prompts 和 prompt_token_ids 作为关键字参数被认为是过时的，将来可能会被弃用。您应该改为通过 inputs 参数传递它们。

为所有 <text,text_pair> 对生成相似度评分。

输入可以是 1 -> 1、1 -> N 或 N -> N。在 1 - N 情况下，text_1 句子将被复制 N 次，以与 text_2 句子配对。输入对用于构建交叉编码器模型的提示列表。此类会自动批量处理提示，同时考虑内存约束。为了获得最佳性能，请将所有文本放入单个列表并将其传递给此方法。

参数:

text_1 – 可以是单个提示或提示列表，在这种情况下，它必须与 text_2 列表的长度相同
text_2 – 要与查询配对以形成 LLM 输入的文本。有关每个提示格式的更多详细信息，请参阅 PromptType。
use_tqdm – 是否使用 tqdm 显示进度条。
lora_request – 用于生成的 LoRA 请求（如果有）。
prompt_adapter_request – 用于生成的 Prompt Adapter 请求（如果有）。

返回:

包含生成的评分的 ScoringRequestOutput 对象列表，顺序与输入提示相同。

sleep(level: int = 1)[source]#

使引擎进入休眠状态。引擎不应处理任何请求。调用者应保证在休眠期间（在调用 wake_up 之前）没有正在处理的请求。

参数:: level – 休眠级别。级别 1 休眠将卸载模型权重并丢弃 kv 缓存。kv 缓存的内容将被遗忘。级别 1 休眠适用于休眠和唤醒引擎以再次运行同一模型。模型权重在 CPU 内存中备份。请确保有足够的 CPU 内存来存储模型权重。级别 2 休眠将同时丢弃模型权重和 kv 缓存。模型权重和 kv 缓存的内容都将被遗忘。级别 2 休眠适用于休眠和唤醒引擎以运行不同的模型或更新模型，在这种情况下，不需要先前的模型权重。它可以减少 CPU 内存压力。

wake_up(tags: list[str] | None = None)[source]#

将引擎从休眠模式唤醒。有关更多详细信息，请参阅 sleep() 方法。

参数:: tags – 用于为特定内存分配重新分配引擎内存的可选标签列表。值必须在 (“weights”, “kv_cache”,) 中。如果为 None，则重新分配所有内存。在再次使用引擎之前，应使用所有标签（或 None）调用 wake_up。

LLM 类

目录

LLM 类#