跳到内容

执行推理

在设置并运行 vLLM Intel® Gaudi® 硬件插件后,您就可以开始执行推理以生成模型输出了。本文档演示了运行推理的几种方法,您可以选择最适合您工作流程的方法。

离线批量推理

离线推理在不运行服务器的情况下对批处理中的多个提示进行处理。这对于批处理作业和测试非常理想。

from vllm import LLM, SamplingParams

def main():
    prompts = [
        "Hello, my name is",
        "The future of AI is",
    ]
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
    llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == "__main__":
    main()

在线推理

在线推理通过运行的 vLLM 服务器提供实时文本生成。要遵循此过程,请启动服务器

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

然后,从 Python 查询

import requests

def main():
    url = "https://:8000/v1/completions"
    headers = {"Content-Type": "application/json"}

    payload = {
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "prompt": "The future of AI is",
        "max_tokens": 50,
        "temperature": 0.8
    }

    response = requests.post(url, headers=headers, json=payload)
    result = response.json()
    print(result["choices"][0]["text"])}

if __name__ == "__main__":
    main()

OpenAI Completions API

vLLM 提供了一个与 OpenAI 兼容的 completions API。要遵循此过程,请启动服务器

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

然后,使用 OpenAI Python 客户端或 curl

from openai import OpenAI
def main():
    client = OpenAI(api_key="EMPTY", base_url="https://:8000/v1")
    result = client.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        prompt="Explain quantum computing in simple terms:",
        max_tokens=100,
        temperature=0.7
    )
    print(result.choices[0].text)
if __name__ == "__main__":
    main()
curl https://:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "prompt": "Explain quantum computing in simple terms:",
        "max_tokens": 100,
        "temperature": 0.7
    }'

OpenAI Chat Completions API 结合 vLLM

vLLM 还支持 OpenAI chat completions API 格式。要遵循此过程,请启动服务器

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

然后,使用 OpenAI Python 客户端或 curl

from openai import OpenAI
def main():
    client = OpenAI(api_key="EMPTY", base_url="https://:8000/v1")
    chat = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"}
        ],
        max_tokens=50,
        temperature=0.7
    )
    print(chat.choices[0].message.content)
if __name__ == "__main__":
    main()
curl https://:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 50,
        "temperature": 0.7
    }'