Gateway 推理扩展#
本教程将指导您在生产环境中设置和使用 Gateway 推理扩展。该扩展通过 gateway 实现推理功能,支持单独的推理模型和推理池。
先决条件#
在开始本教程之前,请确保您已
一个具有可用 GPU 节点的 Kubernetes 集群
kubectl已配置为访问您的集群helm已安装一个拥有 API 令牌的 Hugging Face 账户
对 Kubernetes 概念有基本了解
概述#
Gateway 推理扩展提供
单独模型推理:直接访问特定模型
推理池:对多个模型实例进行负载均衡访问
Gateway API 集成:标准的 Kubernetes Gateway API 用于路由
vLLM 集成:高性能推理引擎支持
步骤 1:环境设置#
1.1 创建 Hugging Face Token Secret#
首先,使用您的 Hugging Face 令牌创建一个 Kubernetes secret
# Replace <YOUR_HF_TOKEN> with your actual Hugging Face token
kubectl create secret generic hf-token --from-literal=token=<YOUR_HF_TOKEN>
1.2 安装 Gateway API CRD#
安装所需的自定义资源定义 (CRDs)
# Install Kgateway CRDs
KGTW_VERSION=v2.0.2
helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds
# Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
# Install Gateway API inference extension CRDs
VERSION=v0.3.0
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/$VERSION/manifests.yaml
1.3 安装带推理扩展的 Kgateway#
# Install Kgateway with inference extension enabled
helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true
步骤 2:部署 vLLM 模型#
2.1 理解 vLLM 运行时#
vLLM 运行时是一个管理模型部署的自定义资源。请查看 configs/vllm/gpu-deployment.yaml 文件以获取示例配置。
2.2 应用 vLLM 部署#
# Apply the vLLM deployment configuration
kubectl apply -f configs/vllm/gpu-deployment.yaml
生产注意事项:
根据您的模型大小和 GPU 容量调整资源请求/限制
考虑使用多个副本以实现高可用性
监控 GPU 利用率并相应调整
步骤 3:配置推理资源#
3.1 单独模型配置#
创建一个 InferenceModel 资源以直接访问模型
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: legogpt
spec:
modelName: legogpt
criticality: Standard
poolRef:
name: vllm-llama3-1b-instruct
targetModels:
- name: legogpt
weight: 100
3.2 推理池配置#
有关路由到多个模型实例的说明,请查看 configs/inferencepool-resources.yaml 文件以获取示例。
3.3 应用推理资源#
# Apply individual model configuration
kubectl apply -f configs/inferencemodel.yaml
# Apply inference pool configuration
kubectl apply -f configs/inferencepool-resources.yaml
步骤 4:配置 Gateway 路由#
4.1 Gateway 配置#
Gateway 作为推理请求的入口点
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: inference-gateway
spec:
gatewayClassName: kgateway
listeners:
- name: http
port: 80
protocol: HTTP
4.2 HTTPRoute 配置#
HTTPRoute 定义了请求如何路由到推理资源
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
- backendRefs:
- group: inference.networking.x-k8s.io
kind: InferencePool
name: vllm-llama3-1b-instruct
matches:
- path:
type: PathPrefix
value: /
4.3 应用 Gateway 资源#
# Apply gateway configuration
kubectl apply -f configs/gateway/kgateway/gateway.yaml
# Apply HTTP route configuration
kubectl apply -f configs/httproute.yaml
步骤 5:测试设置#
5.1 获取 Gateway IP 地址#
# Get the external IP of the gateway
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=80
echo "Gateway IP: $IP"
echo "Gateway Port: $PORT"
5.2 发送测试推理请求#
# Test with a simple completion request
curl -i http://${IP}:${PORT}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "legogpt",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0.5
}'
5.3 测试聊天补全#
# Test chat completion endpoint
curl -i http://${IP}:${PORT}/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "legogpt",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"max_tokens": 50,
"temperature": 0.7
}'
步骤 6:监控和故障排除#
6.1 检查资源状态#
# Check vLLM runtime status
kubectl get vllmruntime
# Check inference model status
kubectl get inferencemodel
# Check inference pool status
kubectl get inferencepool
# Check gateway status
kubectl get gateway
6.2 查看日志#
# Get vLLM runtime logs
kubectl logs -l app=vllm-runtime
# Get gateway logs
kubectl logs -n kgateway-system -l app=kgateway
步骤 7:卸载#
要卸载集群中安装的所有资源,请运行以下命令
# Delete the inference extension
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml --ignore-not-found=true
# Delete the inference model and pool resources
kubectl delete -f configs/inferencemodel.yaml --ignore-not-found=true
kubectl delete -f configs/inferencepool-resources.yaml --ignore-not-found=true
# Delete the VLLM deployment
kubectl delete -f configs/vllm/gpu-deployment.yaml --ignore-not-found=true
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml --ignore-not-found=true
# Delete helm releases
helm uninstall kgateway -n kgateway-system
helm uninstall kgateway-crds -n kgateway-system
# Delete the namespace last to ensure all resources are removed
kubectl delete ns kgateway-system --ignore-not-found=true