推测解码¶

警告

请注意，vLLM 中的推测解码尚未优化，通常无法针对所有提示数据集或采样参数实现token间延迟的降低。优化工作正在进行中，您可以在此处关注：问题 #4630

警告

目前，vLLM 中的推测解码与流水线并行不兼容。

本文档展示了如何在 vLLM 中使用推测解码。推测解码是一种提高内存受限 LLM 推理中token间延迟的技术。

使用草稿模型进行推测¶

以下代码配置 vLLM 以离线模式使用推测解码和草稿模型，每次推测 5 个 token。

代码

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="facebook/opt-6.7b",
    tensor_parallel_size=1,
    speculative_config={
        "model": "facebook/opt-125m",
        "num_speculative_tokens": 5,
    },
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

要在在线模式下执行相同操作，请启动服务器

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --model facebook/opt-6.7b \
    --seed 42 \
    -tp 1 \
    --gpu_memory_utilization 0.8 \
    --speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'

警告

注意：请使用 --speculative_config 来设置所有与推测解码相关的配置。之前通过 --speculative_model 指定模型并单独添加相关参数（例如 --num_speculative_tokens）的方法现已弃用。

然后使用客户端

代码

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Completion API
stream = False
completion = client.completions.create(
    model=model,
    prompt="The future of AI is",
    echo=False,
    n=1,
    stream=stream,
)

print("Completion results:")
if stream:
    for c in completion:
        print(c)
else:
    print(completion)

通过匹配提示中的 n-gram 进行推测¶

以下代码配置 vLLM 使用推测解码，其中提议是通过匹配提示中的 n-gram 生成的。欲了解更多信息，请阅读此主题帖。

代码

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="facebook/opt-6.7b",
    tensor_parallel_size=1,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,
        "prompt_lookup_max": 4,
    },
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

使用 MLP 推测器进行推测¶

以下代码配置 vLLM 使用推测解码，其中提议是由草稿模型生成的，这些草稿模型根据上下文向量和采样 token 来调整草稿预测。欲了解更多信息，请参阅这篇博客或这份技术报告。

代码

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    speculative_config={
        "model": "ibm-ai-platform/llama3-70b-accelerator",
        "draft_tensor_parallel_size": 1,
    },
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

请注意，这些推测模型目前需要在没有张量并行的情况下运行，尽管主模型可以使用张量并行运行（参见上例）。由于推测模型相对较小，我们仍然可以看到显著的加速。然而，此限制将在未来版本中修复。

HF hub 上有多种此类推测模型可供选择

使用基于 EAGLE 的草稿模型进行推测¶

以下代码配置 vLLM 使用推测解码，其中提议由基于 EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) 的草稿模型生成。一个更详细的离线模式示例，包括如何提取请求级别接受率，可以在此处找到。

代码

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=4,
    speculative_config={
        "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
        "draft_tensor_parallel_size": 1,
        "num_speculative_tokens": 2,
    },
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

使用基于 EAGLE 的草稿模型时需要考虑的几个重要事项

在 EAGLE 模型 HF 仓库中可用的 EAGLE 草稿模型在拉取请求 #12304 之后应该能够直接被 vLLM 加载和使用。如果您使用的 vllm 版本在拉取请求 #12304 之前，请使用脚本转换推测模型，并在 speculative_config 中指定 "model": "path/to/modified/eagle/model"。如果在使用最新版本的 vLLM 时仍然出现权重加载问题，请留言或提出问题。
基于 EAGLE 的草稿模型需要在没有张量并行的情况下运行（即 speculative_config 中 draft_tensor_parallel_size 设置为 1），尽管主模型可以使用张量并行运行（参见上例）。
在使用基于 EAGLE 的推测器与 vLLM 配合时，观察到的加速比参考实现此处中报告的要低。此问题正在调查中，并在此处跟踪：问题 #9565。

Hugging Face hub 上有多种 EAGLE 草稿模型可供选择

基础模型	Hugging Face 上的 EAGLE 模型	# EAGLE 参数
Vicuna-7B-v1.3	yuhuili/EAGLE-Vicuna-7B-v1.3	0.24B
Vicuna-13B-v1.3	yuhuili/EAGLE-Vicuna-13B-v1.3	0.37B
Vicuna-33B-v1.3	yuhuili/EAGLE-Vicuna-33B-v1.3	0.56B
LLaMA2-Chat 7B	yuhuili/EAGLE-llama2-chat-7B	0.24B
LLaMA2-Chat 13B	yuhuili/EAGLE-llama2-chat-13B	0.37B
LLaMA2-Chat 70B	yuhuili/EAGLE-llama2-chat-70B	0.99B
Mixtral-8x7B-Instruct-v0.1	yuhuili/EAGLE-mixtral-instruct-8x7B	0.28B
LLaMA3-Instruct 8B	yuhuili/EAGLE-LLaMA3-Instruct-8B	0.25B
LLaMA3-Instruct 70B	yuhuili/EAGLE-LLaMA3-Instruct-70B	0.99B
Qwen2-7B-Instruct	yuhuili/EAGLE-Qwen2-7B-Instruct	0.26B
Qwen2-72B-Instruct	yuhuili/EAGLE-Qwen2-72B-Instruct	1.05B

推测解码的无损保证¶

在 vLLM 中，推测解码旨在提高推理效率同时保持准确性。本节阐述了推测解码的无损保证，将这些保证分为三个关键领域

理论无损性 - 推测解码采样在理论上是无损的，直至硬件数值精度限制。浮点错误可能导致输出分布的轻微变化，如使用推测采样加速大型语言模型解码中所述。
算法无损性 - vLLM 的推测解码实现经过算法验证是无损的。关键验证测试包括
- 拒绝采样器收敛性：确保 vLLM 的拒绝采样器生成的样本与目标分布一致。查看测试代码
- 贪婪采样等效性：确认使用推测解码的贪婪采样与不使用推测解码的贪婪采样结果一致。这验证了 vLLM 的推测解码框架在与 vLLM 前向传递和 vLLM 拒绝采样器集成时，提供了无损保证。几乎所有位于 tests/spec_decode/e2e 的测试都使用此断言实现来验证此属性。
vLLM 对数概率稳定性 - vLLM 目前不保证稳定的 token 对数概率（logprobs）。这可能导致同一请求在不同运行中产生不同的输出。有关更多详细信息，请参阅常见问题中题为 vLLM 中同一提示的输出在不同运行中是否会变化？ 的常见问题部分。

虽然 vLLM 致力于确保推测解码的无损性，但使用和不使用推测解码生成的输出可能因以下因素而异

浮点精度：硬件数值精度的差异可能导致输出分布的轻微偏差。
批处理大小和数值稳定性：批处理大小的变化可能导致对数概率和输出概率的波动，这可能是由于批处理操作中的非确定性行为或数值不稳定造成的。

有关缓解策略，请参阅常见问题中 vLLM 中同一提示的输出在不同运行中是否会变化？ 的常见问题条目。