CRD 部署

CRD 部署#

使用自定义资源定义 (CRDs) 和自定义资源 (CRs) 在 Kubernetes 上部署最小化的 vLLM 生产环境堆栈。

注意

此部署方法推荐用于生产环境，因为它通过 Kubernetes Operator 提供了更好的资源管理、监控和生命周期管理。

先决条件#

kubectl 版本 v1.11.3+
对 Kubernetes v1.11.3+ 集群的访问权限

安装#

克隆仓库

首先，克隆 vLLM 生产环境堆栈仓库

git clone https://github.com/vllm-project/production-stack.git

部署 Operator

通过运行以下命令来部署生产环境堆栈 Operator
```
kubectl create -f operator/config/default.yaml
```
此命令会执行以下操作
- 命名空间创建：创建一个名为 production-stack-system 的命名空间，Operator 将在此命名空间中运行
- 自定义资源定义 (CRDs)：定义 4 个可由此 Operator 管理的新自定义资源
  - CacheServer：用于管理缓存服务器
  - LoraAdapter：用于管理 LoRA 适配器（用于模型微调）
  - VLLMRouter：用于管理 vLLM 路由
  - VLLMRuntime：用于管理 vLLM 运行时实例
- RBAC (基于角色的访问控制)：创建各种角色和角色绑定，以控制对这些资源的访问权限，包括
  - 管理员角色（完全访问）
  - 编辑者角色（创建/更新/删除）
  - 查看者角色（只读）
  - 指标和领导者选举
- 服务账户：为 Operator 创建一个名为 production-stack-controller-manager 的服务账户
- 部署：以部署的形式部署 Operator 控制器管理器，包含健康检查、资源限制和安全设置，使用镜像 lmcache/production-stack-operator:latest
- 服务：创建一个用于监控 Operator 的指标服务

验证 Operator 部署

检查 Operator 部署的状态

kubectl get pods -n production-stack-system
kubectl get deployment -n production-stack-system

您应该会看到类似以下的输出

NAME                                                              READY   STATUS    RESTARTS   AGE
production-stack-production-stack-controller-manager-65b86brxm6   1/1     Running   0          21s

NAME                                                   READY   UP-TO-DATE   AVAILABLE   AGE
production-stack-production-stack-controller-manager   1/1     1            1           25s

部署 vLLM 资源#

部署 vLLM 运行时

(可选) 如果您的模型需要 Hugging Face 令牌（例如 Llama-3.1-8B），请创建一个 secret
```
kubectl create secret generic huggingface-token \
  --from-literal=token=<your-hf-token> \
  --namespace=default
```
部署 vLLM 运行时
```
kubectl apply -f operator/config/samples/production-stack_v1alpha1_vllmruntime.yaml
```
这将在您的 Kubernetes 集群中创建一个 vLLM 运行时实例。

部署 vLLM Router

启动 vLLM Router

kubectl apply -f operator/config/samples/production-stack_v1alpha1_vllmrouter.yaml

验证两个组件是否正在运行

kubectl get pods

您应该会看到

      NAME                                  READY   STATUS    RESTARTS   AGE
vllmrouter-sample-6fc78b7f85-lt5n7    1/1     Running   0          3m31s
vllmruntime-sample-7448f7547c-pdfml   1/1     Running   0          6m10s

首次部署故障排除

如果您遇到 RunContainerError，请检查日志

kubectl get pods
kubectl logs <pod-name>
kubectl describe pod <pod-name>

示例配置#

VLLMRuntime 示例 (production-stack_v1alpha1_vllmruntime.yaml)

apiVersion: production-stack.vllm.ai/v1alpha1
kind: VLLMRuntime
metadata:
  labels:
    app.kubernetes.io/name: production-stack
    app.kubernetes.io/managed-by: kustomize
  name: vllmruntime-sample
spec:
  # Model configuration
  model:
    modelURL: "meta-llama/Llama-3.1-8B"
    enableLoRA: false
    enableTool: false
    toolCallParser: ""
    maxModelLen: 4096
    dtype: "bfloat16"
    maxNumSeqs: 32
    # HuggingFace token secret (optional)
    hfTokenSecret:
      name: "huggingface-token"
    hfTokenName: "token"

  # vLLM server configuration
  vllmConfig:
    # vLLM specific configurations
    enableChunkedPrefill: false
    enablePrefixCaching: false
    tensorParallelSize: 1
    gpuMemoryUtilization: "0.8"
    maxLoras: 4
    extraArgs: ["--disable-log-requests"]
    v1: true
    port: 8000
    # Environment variables
    env:
      - name: HF_HOME
        value: "/data"

  # LM Cache configuration
  lmCacheConfig:
    enabled: true
    cpuOffloadingBufferSize: "15"
    diskOffloadingBufferSize: "0"
    remoteUrl: "lm://cacheserver-sample.default.svc.cluster.local:80"
    remoteSerde: "naive"

  # Deployment configuration
  deploymentConfig:
    # Resource requirements
    resources:
      cpu: "10"
      memory: "32Gi"
      gpu: "1"

    # Image configuration
    image:
      registry: "docker.io"
      name: "lmcache/vllm-openai:2025-05-27-v1"
      pullPolicy: "IfNotPresent"
      pullSecretName: ""

    # Number of replicas
    replicas: 1

    # Deployment strategy
    deploymentStrategy: "Recreate"

VLLMRouter 示例 (production-stack_v1alpha1_vllmrouter.yaml)

apiVersion: production-stack.vllm.ai/v1alpha1
kind: VLLMRouter
metadata:
  labels:
    app.kubernetes.io/name: production-stack
    app.kubernetes.io/managed-by: kustomize
  name: vllmrouter-sample
spec:
  # Enable the router deployment
  enableRouter: true

  # Number of router replicas
  replicas: 1

  # Service discovery method (k8s or static)
  serviceDiscovery: k8s

  # Label selector for vLLM runtime pods
  k8sLabelSelector: "app=vllmruntime-sample"

  # Routing strategy (roundrobin or session)
  routingLogic: roundrobin

  # Engine statistics collection interval
  engineScrapeInterval: 30

  # Request statistics window
  requestStatsWindow: 60

  # Container port for the router service
  port: 80

  # Service account name
  serviceAccountName: vllmrouter-sa

  # Image configuration
  image:
    registry: docker.io
    name: lmcache/lmstack-router
    pullPolicy: IfNotPresent

  # Resource requirements
  resources:
    cpu: "2"
    memory: "8Gi"

  # Environment variables
  env:
    - name: LOG_LEVEL
      value: "info"
    - name: METRICS_ENABLED
      value: "true"

  # Node selector for pod scheduling
  nodeSelectorTerms:
    - matchExpressions:
        - key: kubernetes.io/os
          operator: In
          values:
            - linux

测试部署#

端口转发 Router

将 Router 服务本地暴露

kubectl port-forward svc/vllmrouter-sample 30080:80 --address 0.0.0.0

使用简单请求进行测试

在单独的终端中，使用 curl 命令测试部署

curl -X POST https://:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B",
    "prompt": "1 plus 1 equals to",
    "max_tokens": 100
  }'

成功的响应应该看起来像

{
  "id": "cmpl-0c3a06af79df4cb2a5e6f8c3fb1f1215",
  "object": "text_completion",
  "created": 1750121964,
  "model": "meta-llama/Llama-3.1-8B",
  "choices": [
    {
      "index": 0,
      "text": " 2\nThis is a very simple equation...",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 108,
    "completion_tokens": 100,
    "prompt_tokens_details": null
  },
           "kv_transfer_params": null
 }

卸载#

删除自定义资源

kubectl delete vllmrouter vllmrouter-sample
kubectl delete vllmruntime vllmruntime-sample

删除 Secret（如果已创建）

kubectl delete secret huggingface-token --namespace=default

删除 Operator 和 CRDs

删除整个 Operator 部署和自定义资源定义
```
kubectl delete -f operator/config/default.yaml
```

验证清理

确认所有资源都已删除

kubectl get namespace production-stack-system

kubectl get crd | grep production-stack

kubectl get pods --all-namespaces | grep -E "(vllmruntime|vllmrouter)"

您不应该从这些命令中看到任何结果，这表明清理成功。

祝您部署愉快！🚀