网关推理扩展

Gateway 推理扩展#

本教程将指导您在生产环境中设置和使用 Gateway 推理扩展。该扩展通过 gateway 实现推理功能，支持单独的推理模型和推理池。

先决条件#

在开始本教程之前，请确保您已

一个具有可用 GPU 节点的 Kubernetes 集群
kubectl 已配置为访问您的集群
helm 已安装
一个拥有 API 令牌的 Hugging Face 账户
对 Kubernetes 概念有基本了解

概述#

Gateway 推理扩展提供

单独模型推理：直接访问特定模型
推理池：对多个模型实例进行负载均衡访问
Gateway API 集成：标准的 Kubernetes Gateway API 用于路由
vLLM 集成：高性能推理引擎支持

步骤 1：环境设置#

1.1 创建 Hugging Face Token Secret#

首先，使用您的 Hugging Face 令牌创建一个 Kubernetes secret

# Replace <YOUR_HF_TOKEN> with your actual Hugging Face token
kubectl create secret generic hf-token --from-literal=token=<YOUR_HF_TOKEN>

1.2 安装 Gateway API CRD#

安装所需的自定义资源定义 (CRDs)

# Install Kgateway CRDs
KGTW_VERSION=v2.0.2
helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds

# Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml

# Install Gateway API inference extension CRDs
VERSION=v0.3.0
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/$VERSION/manifests.yaml

1.3 安装带推理扩展的 Kgateway#

# Install Kgateway with inference extension enabled
helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true

步骤 2：部署 vLLM 模型#

2.1 理解 vLLM 运行时#

vLLM 运行时是一个管理模型部署的自定义资源。请查看 configs/vllm/gpu-deployment.yaml 文件以获取示例配置。

2.2 应用 vLLM 部署#

# Apply the vLLM deployment configuration
kubectl apply -f configs/vllm/gpu-deployment.yaml

生产注意事项:

根据您的模型大小和 GPU 容量调整资源请求/限制
考虑使用多个副本以实现高可用性
监控 GPU 利用率并相应调整

步骤 3：配置推理资源#

3.1 单独模型配置#

创建一个 InferenceModel 资源以直接访问模型

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
  name: legogpt
spec:
  modelName: legogpt
  criticality: Standard
  poolRef:
    name: vllm-llama3-1b-instruct
  targetModels:
  - name: legogpt
    weight: 100

3.2 推理池配置#

有关路由到多个模型实例的说明，请查看 configs/inferencepool-resources.yaml 文件以获取示例。

3.3 应用推理资源#

# Apply individual model configuration
kubectl apply -f configs/inferencemodel.yaml

# Apply inference pool configuration
kubectl apply -f configs/inferencepool-resources.yaml

步骤 4：配置 Gateway 路由#

4.1 Gateway 配置#

Gateway 作为推理请求的入口点

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: kgateway
  listeners:
  - name: http
    port: 80
    protocol: HTTP

4.2 HTTPRoute 配置#

HTTPRoute 定义了请求如何路由到推理资源

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-route
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: inference-gateway
  rules:
  - backendRefs:
    - group: inference.networking.x-k8s.io
      kind: InferencePool
      name: vllm-llama3-1b-instruct
    matches:
    - path:
        type: PathPrefix
        value: /

4.3 应用 Gateway 资源#

# Apply gateway configuration
kubectl apply -f configs/gateway/kgateway/gateway.yaml

# Apply HTTP route configuration
kubectl apply -f configs/httproute.yaml

步骤 5：测试设置#

5.1 获取 Gateway IP 地址#

# Get the external IP of the gateway
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=80

echo "Gateway IP: $IP"
echo "Gateway Port: $PORT"

5.2 发送测试推理请求#

# Test with a simple completion request
curl -i http://${IP}:${PORT}/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "legogpt",
    "prompt": "Write as if you were a critic: San Francisco",
    "max_tokens": 100,
    "temperature": 0.5
  }'

5.3 测试聊天补全#

# Test chat completion endpoint
curl -i http://${IP}:${PORT}/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "legogpt",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "max_tokens": 50,
    "temperature": 0.7
  }'

步骤 6：监控和故障排除#

6.1 检查资源状态#

# Check vLLM runtime status
kubectl get vllmruntime

# Check inference model status
kubectl get inferencemodel

# Check inference pool status
kubectl get inferencepool

# Check gateway status
kubectl get gateway

6.2 查看日志#

# Get vLLM runtime logs
kubectl logs -l app=vllm-runtime

# Get gateway logs
kubectl logs -n kgateway-system -l app=kgateway

步骤 7：卸载#

要卸载集群中安装的所有资源，请运行以下命令

# Delete the inference extension
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml --ignore-not-found=true

# Delete the inference model and pool resources
kubectl delete -f configs/inferencemodel.yaml --ignore-not-found=true
kubectl delete -f configs/inferencepool-resources.yaml --ignore-not-found=true

# Delete the VLLM deployment
kubectl delete -f configs/vllm/gpu-deployment.yaml --ignore-not-found=true
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml --ignore-not-found=true

# Delete helm releases
helm uninstall kgateway -n kgateway-system
helm uninstall kgateway-crds -n kgateway-system

# Delete the namespace last to ensure all resources are removed
kubectl delete ns kgateway-system --ignore-not-found=true