AsyncLLMEngine

AsyncLLMEngine#

class vllm.AsyncLLMEngine(*args, log_requests: bool = True, start_engine_loop: bool = True, **kwargs)[source]#

Bases: EngineClient

LLMEngine 的异步包装器。

此类用于包装 LLMEngine 类，使其成为异步的。它使用 asyncio 创建一个后台循环，持续处理传入的请求。当等待队列中有请求时，LLMEngine 由 generate 方法触发。generate 方法将 LLMEngine 的输出结果传递给调用者。

参数:

log_requests – 是否记录请求。
start_engine_loop – 如果为 True，则在 generate 调用中自动启动运行引擎的后台任务。
*args – LLMEngine 的参数。
**kwargs – LLMEngine 的参数。

async abort(request_id: str) → None[source]#

中止请求。

中止已提交的请求。如果请求已完成或未找到，则此方法将不执行任何操作。

参数:: request_id – 请求的唯一 ID。

async add_lora(lora_request: LoRARequest) → None[source]#: 将新的 LoRA 适配器加载到引擎中，以供未来的请求使用。

async check_health() → None[source]#: 如果引擎不健康，则引发错误。

async encode(prompt: str | TextPrompt | TokensPrompt | ExplicitEncoderDecoderPrompt, pooling_params: PoolingParams, request_id: str, lora_request: LoRARequest | None = None, trace_headers: Mapping[str, str] | None = None, priority: int = 0) → AsyncGenerator[PoolingRequestOutput, None][source]#

为池化模型生成请求的输出。

为请求生成输出。此方法是一个协程。它将请求添加到 LLMEngine 的等待队列中，并将来自 LLMEngine 的输出流式传输到调用者。

参数:

prompt – LLM 的提示。有关每种输入格式的更多详细信息，请参阅 PromptType。
pooling_params – 请求的池化参数。
request_id – 请求的唯一 ID。
lora_request – 用于生成的 LoRA 请求（如果有）。
trace_headers – OpenTelemetry 跟踪标头。
priority – 请求的优先级。仅适用于优先级调度。

Yields:

来自 LLMEngine 的请求的输出 PoolingRequestOutput 对象。

详情

如果引擎未运行，则启动后台循环，该循环迭代调用 engine_step() 以处理等待中的请求。
将请求添加到引擎的 RequestTracker。在下一个后台循环中，此请求将发送到底层引擎。此外，还将创建一个对应的 AsyncStream。
等待来自 AsyncStream 的请求输出并传递它们。

示例

>>> # Please refer to entrypoints/api_server.py for
>>> # the complete example.
>>>
>>> # initialize the engine and the example input
>>> # note that engine_args here is AsyncEngineArgs instance
>>> engine = AsyncLLMEngine.from_engine_args(engine_args)
>>> example_input = {
>>>     "input": "What is LLM?",
>>>     "request_id": 0,
>>> }
>>>
>>> # start the generation
>>> results_generator = engine.encode(
>>>    example_input["input"],
>>>    PoolingParams(),
>>>    example_input["request_id"])
>>>
>>> # get the results
>>> final_output = None
>>> async for request_output in results_generator:
>>>     if await request.is_disconnected():
>>>         # Abort the request if the client disconnects.
>>>         await engine.abort(request_id)
>>>         # Return or raise an error
>>>         ...
>>>     final_output = request_output
>>>
>>> # Process and return the final output
>>> ...

async engine_step(virtual_engine: int) → bool[source]#

启动引擎以处理等待中的请求。

如果有正在进行的请求，则返回 True。

classmethod from_engine_args(engine_args: AsyncEngineArgs, start_engine_loop: bool = True, usage_context: UsageContext = UsageContext.ENGINE_CONTEXT, stat_loggers: Dict[str, StatLoggerBase] | None = None) → AsyncLLMEngine[source]#: 从引擎参数创建异步 LLM 引擎。

classmethod from_vllm_config(vllm_config: VllmConfig, start_engine_loop: bool = True, usage_context: UsageContext = UsageContext.ENGINE_CONTEXT, stat_loggers: dict[str, vllm.engine.metrics_types.StatLoggerBase] | None = None, disable_log_requests: bool = False, disable_log_stats: bool = False) → AsyncLLMEngine[source]#: 从 EngineArgs 创建 AsyncLLMEngine。

async generate(prompt: str | TextPrompt | TokensPrompt | ExplicitEncoderDecoderPrompt, sampling_params: SamplingParams, request_id: str, lora_request: LoRARequest | None = None, trace_headers: Mapping[str, str] | None = None, prompt_adapter_request: PromptAdapterRequest | None = None, priority: int = 0) → AsyncGenerator[RequestOutput, None][source]#

为请求生成输出。

为请求生成输出。此方法是一个协程。它将请求添加到 LLMEngine 的等待队列中，并将来自 LLMEngine 的输出流式传输到调用者。

参数:

prompt – LLM 的提示。有关每种输入格式的更多详细信息，请参阅 PromptType。
sampling_params – 请求的采样参数。
request_id – 请求的唯一 ID。
lora_request – 用于生成的 LoRA 请求（如果有）。
trace_headers – OpenTelemetry 跟踪标头。
prompt_adapter_request – 用于生成的 Prompt Adapter 请求（如果有）。
priority – 请求的优先级。仅适用于优先级调度。

Yields:

输出 RequestOutput 对象，来自 LLMEngine 的请求。

详情

如果引擎未运行，则启动后台循环，该循环迭代调用 engine_step() 以处理等待中的请求。
将请求添加到引擎的 RequestTracker。在下一个后台循环中，此请求将发送到底层引擎。此外，还将创建一个对应的 AsyncStream。
等待来自 AsyncStream 的请求输出并传递它们。

示例

>>> # Please refer to entrypoints/api_server.py for
>>> # the complete example.
>>>
>>> # initialize the engine and the example input
>>> # note that engine_args here is AsyncEngineArgs instance
>>> engine = AsyncLLMEngine.from_engine_args(engine_args)
>>> example_input = {
>>>     "prompt": "What is LLM?",
>>>     "stream": False, # assume the non-streaming case
>>>     "temperature": 0.0,
>>>     "request_id": 0,
>>> }
>>>
>>> # start the generation
>>> results_generator = engine.generate(
>>>    example_input["prompt"],
>>>    SamplingParams(temperature=example_input["temperature"]),
>>>    example_input["request_id"])
>>>
>>> # get the results
>>> final_output = None
>>> async for request_output in results_generator:
>>>     if await request.is_disconnected():
>>>         # Abort the request if the client disconnects.
>>>         await engine.abort(request_id)
>>>         # Return or raise an error
>>>         ...
>>>     final_output = request_output
>>>
>>> # Process and return the final output
>>> ...

async get_decoding_config() → DecodingConfig[source]#: 获取 vLLM 引擎的解码配置。

async get_input_preprocessor() → InputPreprocessor[source]#: 获取 vLLM 引擎的输入预处理器。

async get_lora_config() → LoRAConfig[source]#: 获取 vLLM 引擎的 LoRA 配置。

async get_model_config() → ModelConfig[source]#: 获取 vLLM 引擎的模型配置。

async get_parallel_config() → ParallelConfig[source]#: 获取 vLLM 引擎的并行配置。

async get_scheduler_config() → SchedulerConfig[source]#: 获取 vLLM 引擎的调度配置。

async get_tokenizer(lora_request: LoRARequest | None = None) → transformers.PreTrainedTokenizer | transformers.PreTrainedTokenizerFast | TokenizerBase[source]#: 获取请求的适当的分词器

async is_sleeping() → bool[source]#: 检查引擎是否处于休眠状态

async reset_prefix_cache(device: Device | None = None) → None[source]#: 重置前缀缓存

async static run_engine_loop(engine_ref: ReferenceType)[source]#: 我们使用引擎的弱引用，以便运行循环不会阻止引擎被垃圾回收。

shutdown_background_loop() → None[source]#

关闭后台循环。

此方法需要在清理期间调用，以删除对 self 的引用，并正确地 GC 异步 LLM 引擎持有的资源（例如，执行器及其资源）。

async sleep(level: int = 1) → None[source]#: 使引擎休眠

start_background_loop() → None[source]#: 启动后台循环。

async start_profile() → None[source]#: 开始引擎性能分析

async stop_profile() → None[source]#: 开始引擎性能分析

async wake_up(tags: list[str] | None = None) → None[source]#: 唤醒引擎

AsyncLLMEngine

目录

AsyncLLMEngine#