Speculative Decoding Guide

投机性解码指南#

本指南将介绍如何将投机性解码与 vLLM Ascend 结合使用。投机性解码是一种可以提高内存密集型 LLM 推理中跨 token 延迟的技术。

通过匹配提示中的 n-gram 进行投机#

以下代码将 vLLM Ascend 配置为使用投机性解码，其中通过匹配提示中的 n-gram 来生成候选。

离线推理

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    speculative_config={
        "method": "ngram",
        "num_speculative_tokens": 5,
        "prompt_lookup_max": 4,
    },
)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

使用基于 EAGLE 的草稿模型进行投机#

以下代码将 vLLM Ascend 配置为使用投机性解码，其中候选是通过基于 EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) 的草稿模型生成的。

在 vLLM Ascend 的 v0.12.0rc1 版本中，异步调度器更加稳定，可以启用。我们已将其适配以支持 EAGLE，您可以通过将 async_scheduling=True 设置为以下方式来使用它。如果您遇到任何问题，请随时在 GitHub 上提出 issue。作为一种变通方法，您可以在初始化模型时通过取消设置 async_scheduling=True 来禁用此功能。

离线推理

from vllm import LLM, SamplingParams

prompts = [
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=4,
    distributed_executor_backend="mp",
    enforce_eager=True,
    async_scheduling=True,
    speculative_config={
        "method": "eagle",
        "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B",
        "draft_tensor_parallel_size": 1,
        "num_speculative_tokens": 2,
    },
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

使用基于 EAGLE 的草稿模型时需要注意的几点

HF 上的 EAGLE 模型存储库中提供的 EAGLE 草稿模型应由 vLLM 直接加载和使用。此功能已在 PR #4893 中添加。如果您使用的 vLLM 版本在此拉取请求合并之前发布，请更新到最新版本。
基于 EAGLE 的草稿模型需要在不使用张量并行的情况下运行（即，在 speculative_config 中将 draft_tensor_parallel_size 设置为 1），尽管可以使用张量并行运行主模型（请参阅上面的示例）。
使用基于 EAGLE-3 的草稿模型时，必须将选项“method”设置为“eagle3”。即，在 speculative_config 中指定 "method": "eagle3"。

使用 MTP 投机器进行投机#

以下代码将 vLLM Ascend 配置为使用投机性解码，其中候选是通过 MTP（多 token 预测）生成的，通过并行化多个 token 的预测来提高推理性能。有关 MTP 的更多信息，请参阅 Multi_Token_Prediction

在线推理

vllm serve /deepseek-ai/DeepSeek-V3.2-Exp-W8A8 \
--port 20004 \
--data-parallel-size 1 \
--tensor-parallel-size 16 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name dsv3 \
--max-model-len 36768 \
--max-num-batched-tokens 5000 \
--max-num-seqs 10 \
--quantization ascend \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--speculative-config '{"num_speculative_tokens": 2, "method":"deepseek_mtp", "disable_padded_drafter_batch": "False"}'

使用后缀解码进行投机#

以下代码将 vLLM 配置为使用投机性解码，其中候选是使用后缀解码 (SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications) 生成的。

与 n-gram 类似，后缀解码可以通过模式匹配使用最后 n 个生成的 token 来生成草稿 token。与 n-gram 不同，后缀解码 (1) 可以匹配提示和之前的生成内容，(2) 使用频率计数来预测最可能的续写，以及 (3) 在每次迭代中为每个请求推测自适应数量的 token 以获得更好的接受率。

后缀解码可以提高在高度重复的任务中的性能，例如代码编辑、代理循环（例如，自我反思、自我一致性）和 RL 采样。

[!NOTE] 后缀解码需要 Arctic Inference。您可以使用 pip install arctic-inference 进行安装。

离线推理

  from vllm import LLM, SamplingParams

  prompts = [
      "The future of AI is",
  ]
  sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

  llm = LLM(
      model="meta-llama/Meta-Llama-3.1-8B-Instruct",
      tensor_parallel_size=1,
      enforce_eager=True,
      speculative_config={
          "method": "suffix",
          "num_speculative_tokens": 15,
      },
  )

  outputs = llm.generate(prompts, sampling_params)

  for output in outputs:
      prompt = output.prompt
      generated_text = output.outputs[0].text
      print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Speculative Decoding Guide

目录

投机性解码指南#

通过匹配提示中的 n-gram 进行投机#

使用基于 EAGLE 的草稿模型进行投机#

使用 MTP 投机器进行投机#

使用后缀解码进行投机#