EAGLE 草稿模型¶
以下代码配置 vLLM 以使用投机解码(speculative decoding),其中草稿由基于 EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) 的草稿模型生成。有关离线模式的更详细示例(包括如何提取请求级别的接受率),可以在 examples/offline_inference/spec_decode.py 中找到。
Eagle 草稿模型示例¶
from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=4,
speculative_config={
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
"draft_tensor_parallel_size": 1,
"num_speculative_tokens": 2,
"method": "eagle",
},
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Eagle3 草稿模型示例¶
from vllm import LLM, SamplingParams
prompts = ["The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=2,
speculative_config={
"model": "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3",
"draft_tensor_parallel_size": 2,
"num_speculative_tokens": 2,
"method": "eagle3",
},
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
预训练的 EAGLE 草稿模型¶
Hugging Face hub 上提供了多种 EAGLE 草稿模型
警告
如果您使用的是 vllm<0.7.0,请使用此脚本转换投机模型,并在 speculative_config 中指定 "model": "path/to/modified/eagle/model"。