执行推理¶
在设置并运行 vLLM Intel® Gaudi® 硬件插件后,您就可以开始执行推理以生成模型输出了。本文档演示了运行推理的几种方法,您可以选择最适合您工作流程的方法。
离线批量推理¶
离线推理在不运行服务器的情况下对批处理中的多个提示进行处理。这对于批处理作业和测试非常理想。
from vllm import LLM, SamplingParams
def main():
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == "__main__":
main()
在线推理¶
在线推理通过运行的 vLLM 服务器提供实时文本生成。要遵循此过程,请启动服务器
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
然后,从 Python 查询
import requests
def main():
url = "https://:8000/v1/completions"
headers = {"Content-Type": "application/json"}
payload = {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "The future of AI is",
"max_tokens": 50,
"temperature": 0.8
}
response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(result["choices"][0]["text"])}
if __name__ == "__main__":
main()
OpenAI Completions API¶
vLLM 提供了一个与 OpenAI 兼容的 completions API。要遵循此过程,请启动服务器
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
然后,使用 OpenAI Python 客户端或 curl
from openai import OpenAI
def main():
client = OpenAI(api_key="EMPTY", base_url="https://:8000/v1")
result = client.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
prompt="Explain quantum computing in simple terms:",
max_tokens=100,
temperature=0.7
)
print(result.choices[0].text)
if __name__ == "__main__":
main()
OpenAI Chat Completions API 结合 vLLM¶
vLLM 还支持 OpenAI chat completions API 格式。要遵循此过程,请启动服务器
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
然后,使用 OpenAI Python 客户端或 curl
from openai import OpenAI
def main():
client = OpenAI(api_key="EMPTY", base_url="https://:8000/v1")
chat = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
max_tokens=50,
temperature=0.7
)
print(chat.choices[0].message.content)
if __name__ == "__main__":
main()
curl https://:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 50,
"temperature": 0.7
}'