CRD 部署#
使用自定义资源定义 (CRDs) 和自定义资源 (CRs) 在 Kubernetes 上部署最小化的 vLLM 生产环境堆栈。
注意
此部署方法推荐用于生产环境,因为它通过 Kubernetes Operator 提供了更好的资源管理、监控和生命周期管理。
先决条件#
kubectl 版本 v1.11.3+
对 Kubernetes v1.11.3+ 集群的访问权限
安装#
克隆仓库
首先,克隆 vLLM 生产环境堆栈仓库
git clone https://github.com/vllm-project/production-stack.git
部署 Operator
通过运行以下命令来部署生产环境堆栈 Operator
kubectl create -f operator/config/default.yaml
此命令会执行以下操作
命名空间创建:创建一个名为
production-stack-system的命名空间,Operator 将在此命名空间中运行自定义资源定义 (CRDs):定义 4 个可由此 Operator 管理的新自定义资源
CacheServer:用于管理缓存服务器LoraAdapter:用于管理 LoRA 适配器(用于模型微调)VLLMRouter:用于管理 vLLM 路由VLLMRuntime:用于管理 vLLM 运行时实例
RBAC (基于角色的访问控制):创建各种角色和角色绑定,以控制对这些资源的访问权限,包括
管理员角色(完全访问)
编辑者角色(创建/更新/删除)
查看者角色(只读)
指标和领导者选举
服务账户:为 Operator 创建一个名为
production-stack-controller-manager的服务账户部署:以部署的形式部署 Operator 控制器管理器,包含健康检查、资源限制和安全设置,使用镜像
lmcache/production-stack-operator:latest服务:创建一个用于监控 Operator 的指标服务
验证 Operator 部署
检查 Operator 部署的状态
kubectl get pods -n production-stack-system kubectl get deployment -n production-stack-system
您应该会看到类似以下的输出
NAME READY STATUS RESTARTS AGE production-stack-production-stack-controller-manager-65b86brxm6 1/1 Running 0 21s NAME READY UP-TO-DATE AVAILABLE AGE production-stack-production-stack-controller-manager 1/1 1 1 25s
部署 vLLM 资源#
部署 vLLM 运行时
(可选) 如果您的模型需要 Hugging Face 令牌(例如 Llama-3.1-8B),请创建一个 secret
kubectl create secret generic huggingface-token \ --from-literal=token=<your-hf-token> \ --namespace=default
部署 vLLM 运行时
kubectl apply -f operator/config/samples/production-stack_v1alpha1_vllmruntime.yaml
这将在您的 Kubernetes 集群中创建一个 vLLM 运行时实例。
部署 vLLM Router
启动 vLLM Router
kubectl apply -f operator/config/samples/production-stack_v1alpha1_vllmrouter.yaml
验证两个组件是否正在运行
kubectl get pods
您应该会看到
NAME READY STATUS RESTARTS AGE vllmrouter-sample-6fc78b7f85-lt5n7 1/1 Running 0 3m31s vllmruntime-sample-7448f7547c-pdfml 1/1 Running 0 6m10s
首次部署故障排除
如果您遇到
RunContainerError,请检查日志kubectl get pods kubectl logs <pod-name> kubectl describe pod <pod-name>
示例配置#
VLLMRuntime 示例 (production-stack_v1alpha1_vllmruntime.yaml)
apiVersion: production-stack.vllm.ai/v1alpha1
kind: VLLMRuntime
metadata:
labels:
app.kubernetes.io/name: production-stack
app.kubernetes.io/managed-by: kustomize
name: vllmruntime-sample
spec:
# Model configuration
model:
modelURL: "meta-llama/Llama-3.1-8B"
enableLoRA: false
enableTool: false
toolCallParser: ""
maxModelLen: 4096
dtype: "bfloat16"
maxNumSeqs: 32
# HuggingFace token secret (optional)
hfTokenSecret:
name: "huggingface-token"
hfTokenName: "token"
# vLLM server configuration
vllmConfig:
# vLLM specific configurations
enableChunkedPrefill: false
enablePrefixCaching: false
tensorParallelSize: 1
gpuMemoryUtilization: "0.8"
maxLoras: 4
extraArgs: ["--disable-log-requests"]
v1: true
port: 8000
# Environment variables
env:
- name: HF_HOME
value: "/data"
# LM Cache configuration
lmCacheConfig:
enabled: true
cpuOffloadingBufferSize: "15"
diskOffloadingBufferSize: "0"
remoteUrl: "lm://cacheserver-sample.default.svc.cluster.local:80"
remoteSerde: "naive"
# Deployment configuration
deploymentConfig:
# Resource requirements
resources:
cpu: "10"
memory: "32Gi"
gpu: "1"
# Image configuration
image:
registry: "docker.io"
name: "lmcache/vllm-openai:2025-05-27-v1"
pullPolicy: "IfNotPresent"
pullSecretName: ""
# Number of replicas
replicas: 1
# Deployment strategy
deploymentStrategy: "Recreate"
VLLMRouter 示例 (production-stack_v1alpha1_vllmrouter.yaml)
apiVersion: production-stack.vllm.ai/v1alpha1
kind: VLLMRouter
metadata:
labels:
app.kubernetes.io/name: production-stack
app.kubernetes.io/managed-by: kustomize
name: vllmrouter-sample
spec:
# Enable the router deployment
enableRouter: true
# Number of router replicas
replicas: 1
# Service discovery method (k8s or static)
serviceDiscovery: k8s
# Label selector for vLLM runtime pods
k8sLabelSelector: "app=vllmruntime-sample"
# Routing strategy (roundrobin or session)
routingLogic: roundrobin
# Engine statistics collection interval
engineScrapeInterval: 30
# Request statistics window
requestStatsWindow: 60
# Container port for the router service
port: 80
# Service account name
serviceAccountName: vllmrouter-sa
# Image configuration
image:
registry: docker.io
name: lmcache/lmstack-router
pullPolicy: IfNotPresent
# Resource requirements
resources:
cpu: "2"
memory: "8Gi"
# Environment variables
env:
- name: LOG_LEVEL
value: "info"
- name: METRICS_ENABLED
value: "true"
# Node selector for pod scheduling
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
测试部署#
端口转发 Router
将 Router 服务本地暴露
kubectl port-forward svc/vllmrouter-sample 30080:80 --address 0.0.0.0
使用简单请求进行测试
在单独的终端中,使用 curl 命令测试部署
curl -X POST https://:30080/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B", "prompt": "1 plus 1 equals to", "max_tokens": 100 }'
成功的响应应该看起来像
{ "id": "cmpl-0c3a06af79df4cb2a5e6f8c3fb1f1215", "object": "text_completion", "created": 1750121964, "model": "meta-llama/Llama-3.1-8B", "choices": [ { "index": 0, "text": " 2\nThis is a very simple equation...", "logprobs": null, "finish_reason": "length", "stop_reason": null, "prompt_logprobs": null } ], "usage": { "prompt_tokens": 8, "total_tokens": 108, "completion_tokens": 100, "prompt_tokens_details": null }, "kv_transfer_params": null }
卸载#
删除自定义资源
kubectl delete vllmrouter vllmrouter-sample kubectl delete vllmruntime vllmruntime-sample
删除 Secret(如果已创建)
kubectl delete secret huggingface-token --namespace=default
删除 Operator 和 CRDs
删除整个 Operator 部署和自定义资源定义
kubectl delete -f operator/config/default.yaml
验证清理
确认所有资源都已删除
kubectl get namespace production-stack-system kubectl get crd | grep production-stack kubectl get pods --all-namespaces | grep -E "(vllmruntime|vllmrouter)"
您不应该从这些命令中看到任何结果,这表明清理成功。
祝您部署愉快!🚀