跳到内容

EAGLE 草稿模型

以下代码配置 vLLM 以使用投机解码(speculative decoding),其中草稿由基于 EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) 的草稿模型生成。有关离线模式的更详细示例(包括如何提取请求级别的接受率),可以在 examples/offline_inference/spec_decode.py 中找到。

Eagle 草稿模型示例

from vllm import LLM, SamplingParams

prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=4,
    speculative_config={
        "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
        "draft_tensor_parallel_size": 1,
        "num_speculative_tokens": 2,
        "method": "eagle",
    },
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Eagle3 草稿模型示例

from vllm import LLM, SamplingParams

prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=2,
    speculative_config={
        "model": "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3",
        "draft_tensor_parallel_size": 2,
        "num_speculative_tokens": 2,
        "method": "eagle3",
    },
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

预训练的 EAGLE 草稿模型

Hugging Face hub 上提供了多种 EAGLE 草稿模型

警告

如果您使用的是 vllm<0.7.0,请使用此脚本转换投机模型,并在 speculative_config 中指定 "model": "path/to/modified/eagle/model"