Cerebrium#
vLLM 可以在基于云的 GPU 机器上通过 Cerebrium 运行,Cerebrium 是一个无服务器 AI 基础设施平台,旨在简化公司构建和部署基于 AI 的应用程序的流程。
要安装 Cerebrium 客户端,请运行
pip install cerebrium
cerebrium login
接下来,创建您的 Cerebrium 项目,运行
cerebrium init vllm-project
接下来,要安装所需的软件包,请将以下内容添加到您的 cerebrium.toml 文件中
[cerebrium.deployment]
docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
[cerebrium.dependencies.pip]
vllm = "latest"
接下来,让我们添加代码来处理您选择的 LLM 的推理(本示例中使用 mistralai/Mistral-7B-Instruct-v0.1
),将以下代码添加到您的 main.py
文件中
from vllm import LLM, SamplingParams
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
results = []
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
results.append({"prompt": prompt, "generated_text": generated_text})
return {"results": results}
然后,运行以下代码将其部署到云端
cerebrium deploy
如果成功,您应该会收到一个 CURL 命令,您可以使用该命令调用推理。请记住在 URL 末尾添加您正在调用的函数名称(在本例中为 /run
)
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
-H 'Content-Type: application/json' \
-H 'Authorization: <JWT TOKEN>' \
--data '{
"prompts": [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is"
]
}'
您应该会收到如下响应
{
"run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
"result": {
"result": [
{
"prompt": "Hello, my name is",
"generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
},
{
"prompt": "The president of the United States is",
"generated_text": " elected every four years. This is a democratic system.\n\n5. What"
},
{
"prompt": "The capital of France is",
"generated_text": " Paris.\n"
},
{
"prompt": "The future of AI is",
"generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
}
]
},
"run_time_ms": 152.53663063049316
}
现在您拥有一个自动缩放的端点,您只需为您使用的计算资源付费!