休眠与唤醒模式

睡眠和唤醒模式#

本教程演示了如何将 vLLM v1 的 sleep 和 wake_up mode 功能与 vLLM Production Stack 结合使用。睡眠中的引擎不处理任何请求，并释放资源（例如 GPU 内存）。路由器支持为已启用睡眠模式的 vLLM 引擎提供服务。

目录#

先决条件
第 1 步：启用休眠模式进行部署
第 2 步：端口转发
第 3 步：测试引擎的休眠模式功能
第 4 步：清理

先决条件#

完成以下教程
- 先决条件
- 快速入门
具有GPU支持的Kubernetes环境
对Kubernetes和Helm有基本了解

步骤 1：部署时启用睡眠模式#

我们将使用预定义的配置文件 values-19-sleep-mode-aware.yaml，该文件设置了一个已启用睡眠模式的 vLLM 实例。

使用配置部署Helm chart

helm install vllm helm/ -f tutorials/assets/values-19-sleep-mode-aware.yaml

等待部署完成

kubectl get pods -w

步骤2：端口转发#

将路由器服务端口转发到您的本地机器

kubectl port-forward svc/vllm-router-service 30080:80

步骤 3：测试引擎的睡眠模式功能#

首先，获取可用引擎列表

curl -o- https://:30080/engines | jq

预期输出类似

[
  {
    "engine_id": "b36921ab-6611-58c0-a941-16c51296446b",
    "serving_models": [
      "ibm-granite/granite-3.0-3b-a800m-instruct"
    ],
    "created": 1750035988
  }
]

使用目标引擎的 id，检查 vLLM 引擎的睡眠状态

curl -o- https://:30080/is_sleeping?id=b36921ab-6611-58c0-a941-16c51296446b | jq

预期输出

{
  "is_sleeping": false
}

让引擎进入睡眠状态并检查其睡眠状态

curl -X POST https://:30080/sleep?id=b36921ab-6611-58c0-a941-16c51296446b | jq

预期输出

{
  "status": "success"
}

curl -o- https://:30080/is_sleeping?id=b36921ab-6611-58c0-a941-16c51296446b | jq

预期输出

{
  "is_sleeping": true
}

vLLM pod 的日志显示引擎已进入睡眠状态

INFO 06-15 18:08:18 [gpu_worker.py:81] Sleep mode freed 39.26 GiB memory, 1.20 GiB memory is still in use.
INFO 06-15 18:08:18 [executor_base.py:210] It took 5.749613 seconds to fall asleep.
INFO:     10.130.2.172:47082 - "POST /sleep HTTP/1.1" 200 OK

然后，向路由器发送请求

curl https://:30080/v1/completions?id=b36921ab-6611-58c0-a941-16c51296446b \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibm-granite/granite-3.0-3b-a800m-instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 100
  }' | jq

预期输出

{
  "error": "Model ibm-granite/granite-3.0-3b-a800m-instruct not found or vLLM engine is sleeping."
}

现在，唤醒 vLLM 引擎并检查其睡眠状态

curl -X POST https://:30080/wake_up?id=b36921ab-6611-58c0-a941-16c51296446b | jq

预期输出

{
  "status": "success"
}

curl -o- https://:30080/is_sleeping?id=b36921ab-6611-58c0-a941-16c51296446b | jq

预期输出

{
  "is_sleeping": false
}

vLLM pod 的日志显示引擎已唤醒

INFO 06-15 18:11:37 [api_server.py:719] wake up the engine with tags: None
INFO 06-15 18:11:37 [executor_base.py:226] It took 0.284914 seconds to wake up tags {'kv_cache', 'weights'}.
INFO:     10.130.2.172:46672 - "POST /wake_up HTTP/1.1" 200 OK

最后，重新向路由器发送请求

curl https://:30080/v1/completions?id=b36921ab-6611-58c0-a941-16c51296446b \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ibm-granite/granite-3.0-3b-a800m-instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 100
  }' | jq

预期输出类似

{
  "id": "cmpl-125b905e89a34384af754a24bc8ea686",
  "object": "text_completion",
  "created": 1750036455,
  "model": "ibm-granite/granite-3.0-3b-a800m-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\n\nThe capital of France is Paris.\n\n[Answer] The capital of France is Paris.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 31,
    "completion_tokens": 24,
    "prompt_tokens_details": null
  }
}

步骤4：清理#

清理部署

helm uninstall vllm

结论#

在本教程中，我们演示了如何

使用 vLLM 引擎启用睡眠模式来部署 vLLM Production Stack
设置端口转发以访问路由器
测试 vLLM 引擎的睡眠模式功能和睡眠模式感知路由功能

睡眠感知路由功能有助于确保路由器不会将请求转发给处于睡眠状态的引擎。因此，提高了 Production Stack 的性能。