CRD 部署#

使用自定义资源定义 (CRDs) 和自定义资源 (CRs) 在 Kubernetes 上部署最小化的 vLLM 生产环境堆栈。

注意

此部署方法推荐用于生产环境,因为它通过 Kubernetes Operator 提供了更好的资源管理、监控和生命周期管理。

先决条件#

  • kubectl 版本 v1.11.3+

  • 对 Kubernetes v1.11.3+ 集群的访问权限

安装#

  1. 克隆仓库

    首先,克隆 vLLM 生产环境堆栈仓库

    git clone https://github.com/vllm-project/production-stack.git
    
  2. 部署 Operator

    通过运行以下命令来部署生产环境堆栈 Operator

    kubectl create -f operator/config/default.yaml
    

    此命令会执行以下操作

    • 命名空间创建:创建一个名为 production-stack-system 的命名空间,Operator 将在此命名空间中运行

    • 自定义资源定义 (CRDs):定义 4 个可由此 Operator 管理的新自定义资源

      • CacheServer:用于管理缓存服务器

      • LoraAdapter:用于管理 LoRA 适配器(用于模型微调)

      • VLLMRouter:用于管理 vLLM 路由

      • VLLMRuntime:用于管理 vLLM 运行时实例

    • RBAC (基于角色的访问控制):创建各种角色和角色绑定,以控制对这些资源的访问权限,包括

      • 管理员角色(完全访问)

      • 编辑者角色(创建/更新/删除)

      • 查看者角色(只读)

      • 指标和领导者选举

    • 服务账户:为 Operator 创建一个名为 production-stack-controller-manager 的服务账户

    • 部署:以部署的形式部署 Operator 控制器管理器,包含健康检查、资源限制和安全设置,使用镜像 lmcache/production-stack-operator:latest

    • 服务:创建一个用于监控 Operator 的指标服务

  3. 验证 Operator 部署

    检查 Operator 部署的状态

    kubectl get pods -n production-stack-system
    kubectl get deployment -n production-stack-system
    

    您应该会看到类似以下的输出

    NAME                                                              READY   STATUS    RESTARTS   AGE
    production-stack-production-stack-controller-manager-65b86brxm6   1/1     Running   0          21s
    
    NAME                                                   READY   UP-TO-DATE   AVAILABLE   AGE
    production-stack-production-stack-controller-manager   1/1     1            1           25s
    

部署 vLLM 资源#

  1. 部署 vLLM 运行时

    (可选) 如果您的模型需要 Hugging Face 令牌(例如 Llama-3.1-8B),请创建一个 secret

    kubectl create secret generic huggingface-token \
      --from-literal=token=<your-hf-token> \
      --namespace=default
    

    部署 vLLM 运行时

    kubectl apply -f operator/config/samples/production-stack_v1alpha1_vllmruntime.yaml
    

    这将在您的 Kubernetes 集群中创建一个 vLLM 运行时实例。

  2. 部署 vLLM Router

    启动 vLLM Router

    kubectl apply -f operator/config/samples/production-stack_v1alpha1_vllmrouter.yaml
    

    验证两个组件是否正在运行

    kubectl get pods
    

    您应该会看到

          NAME                                  READY   STATUS    RESTARTS   AGE
    vllmrouter-sample-6fc78b7f85-lt5n7    1/1     Running   0          3m31s
    vllmruntime-sample-7448f7547c-pdfml   1/1     Running   0          6m10s
    
  3. 首次部署故障排除

    如果您遇到 RunContainerError,请检查日志

    kubectl get pods
    kubectl logs <pod-name>
    kubectl describe pod <pod-name>
    

示例配置#

VLLMRuntime 示例 (production-stack_v1alpha1_vllmruntime.yaml)

apiVersion: production-stack.vllm.ai/v1alpha1
kind: VLLMRuntime
metadata:
  labels:
    app.kubernetes.io/name: production-stack
    app.kubernetes.io/managed-by: kustomize
  name: vllmruntime-sample
spec:
  # Model configuration
  model:
    modelURL: "meta-llama/Llama-3.1-8B"
    enableLoRA: false
    enableTool: false
    toolCallParser: ""
    maxModelLen: 4096
    dtype: "bfloat16"
    maxNumSeqs: 32
    # HuggingFace token secret (optional)
    hfTokenSecret:
      name: "huggingface-token"
    hfTokenName: "token"

  # vLLM server configuration
  vllmConfig:
    # vLLM specific configurations
    enableChunkedPrefill: false
    enablePrefixCaching: false
    tensorParallelSize: 1
    gpuMemoryUtilization: "0.8"
    maxLoras: 4
    extraArgs: ["--disable-log-requests"]
    v1: true
    port: 8000
    # Environment variables
    env:
      - name: HF_HOME
        value: "/data"

  # LM Cache configuration
  lmCacheConfig:
    enabled: true
    cpuOffloadingBufferSize: "15"
    diskOffloadingBufferSize: "0"
    remoteUrl: "lm://cacheserver-sample.default.svc.cluster.local:80"
    remoteSerde: "naive"

  # Deployment configuration
  deploymentConfig:
    # Resource requirements
    resources:
      cpu: "10"
      memory: "32Gi"
      gpu: "1"

    # Image configuration
    image:
      registry: "docker.io"
      name: "lmcache/vllm-openai:2025-05-27-v1"
      pullPolicy: "IfNotPresent"
      pullSecretName: ""

    # Number of replicas
    replicas: 1

    # Deployment strategy
    deploymentStrategy: "Recreate"

VLLMRouter 示例 (production-stack_v1alpha1_vllmrouter.yaml)

apiVersion: production-stack.vllm.ai/v1alpha1
kind: VLLMRouter
metadata:
  labels:
    app.kubernetes.io/name: production-stack
    app.kubernetes.io/managed-by: kustomize
  name: vllmrouter-sample
spec:
  # Enable the router deployment
  enableRouter: true

  # Number of router replicas
  replicas: 1

  # Service discovery method (k8s or static)
  serviceDiscovery: k8s

  # Label selector for vLLM runtime pods
  k8sLabelSelector: "app=vllmruntime-sample"

  # Routing strategy (roundrobin or session)
  routingLogic: roundrobin

  # Engine statistics collection interval
  engineScrapeInterval: 30

  # Request statistics window
  requestStatsWindow: 60

  # Container port for the router service
  port: 80

  # Service account name
  serviceAccountName: vllmrouter-sa

  # Image configuration
  image:
    registry: docker.io
    name: lmcache/lmstack-router
    pullPolicy: IfNotPresent

  # Resource requirements
  resources:
    cpu: "2"
    memory: "8Gi"

  # Environment variables
  env:
    - name: LOG_LEVEL
      value: "info"
    - name: METRICS_ENABLED
      value: "true"

  # Node selector for pod scheduling
  nodeSelectorTerms:
    - matchExpressions:
        - key: kubernetes.io/os
          operator: In
          values:
            - linux

测试部署#

  1. 端口转发 Router

    将 Router 服务本地暴露

    kubectl port-forward svc/vllmrouter-sample 30080:80 --address 0.0.0.0
    
  2. 使用简单请求进行测试

    在单独的终端中,使用 curl 命令测试部署

    curl -X POST https://:30080/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "meta-llama/Llama-3.1-8B",
        "prompt": "1 plus 1 equals to",
        "max_tokens": 100
      }'
    

    成功的响应应该看起来像

    {
      "id": "cmpl-0c3a06af79df4cb2a5e6f8c3fb1f1215",
      "object": "text_completion",
      "created": 1750121964,
      "model": "meta-llama/Llama-3.1-8B",
      "choices": [
        {
          "index": 0,
          "text": " 2\nThis is a very simple equation...",
          "logprobs": null,
          "finish_reason": "length",
          "stop_reason": null,
          "prompt_logprobs": null
        }
      ],
      "usage": {
        "prompt_tokens": 8,
        "total_tokens": 108,
        "completion_tokens": 100,
        "prompt_tokens_details": null
      },
               "kv_transfer_params": null
     }
    

卸载#

  1. 删除自定义资源

    kubectl delete vllmrouter vllmrouter-sample
    kubectl delete vllmruntime vllmruntime-sample
    
  2. 删除 Secret(如果已创建)

    kubectl delete secret huggingface-token --namespace=default
    
  3. 删除 Operator 和 CRDs

    删除整个 Operator 部署和自定义资源定义

    kubectl delete -f operator/config/default.yaml
    
  4. 验证清理

    确认所有资源都已删除

    kubectl get namespace production-stack-system
    
    kubectl get crd | grep production-stack
    
    kubectl get pods --all-namespaces | grep -E "(vllmruntime|vllmrouter)"
    

    您不应该从这些命令中看到任何结果,这表明清理成功。

祝您部署愉快!🚀