前缀感知路由

Prefix Aware Routing#

本教程演示了如何在 vLLM Production Stack 中使用 Prefix Aware Routing。Prefix Aware Routing 确保具有相同提示前缀的后续请求被路由到同一实例，从而最大化 KV 缓存利用率并提高性能。

目录#

先决条件
第 1 步：使用前缀感知路由进行部署
第 2 步：端口转发
第 3 步：测试前缀感知路由
第 4 步：清理

先决条件#

完成以下教程
- 先决条件
- 快速入门
具有GPU支持的Kubernetes环境
对Kubernetes和Helm有基本了解

Step 1: Deploy with Prefix Aware Routing#

我们将使用预定义的配置文件 values-18-prefix-aware.yaml，该文件设置了两个启用了 Prefix Aware Routing 的 vLLM 实例。

使用配置部署Helm chart

helm install vllm helm/ -f tutorials/assets/values-18-prefix-aware.yaml

等待部署完成

kubectl get pods -w

步骤2：端口转发#

将路由器服务端口转发到您的本地机器

kubectl port-forward svc/vllm-router-service 30080:80

Step 3: Testing Prefix Aware Routing#

首先，向路由器发送一个请求

curl https://:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 100
  }'

然后，发送另一个具有相同提示前缀的请求

curl https://:30080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "prompt": "What is the capital of France? And what is its population?",
    "max_tokens": 100
  }'

您应该会观察到第二个请求被路由到与第一个请求相同的实例。这是因为 Prefix Aware 路由器检测到第二个请求与第一个请求共享一个前缀，并将其路由到同一实例以最大化 KV 缓存利用率。

具体来说，您应该会看到类似以下的日志

[2025-06-03 06:16:28,963] LMCache DEBUG: Scheduled to load 5 tokens for request cmpl-306538839e87480ca5604ecc5f75c847-0 (vllm_v1_adapter.py:299:lmcache.integration.vllm.vllm_v1_adapter)
[2025-06-03 06:16:28,966] LMCache DEBUG: Retrieved 6 out of 6 out of total 6 tokens (cache_engine.py:330:lmcache.experimental.cache_engine)

步骤4：清理#

清理部署

helm uninstall vllm

结论#

在本教程中，我们演示了如何

使用 Prefix Aware Routing 部署 vLLM Production Stack
设置端口转发以访问路由器
测试 Prefix Aware Routing 功能

Prefix Aware Routing 功能通过确保共享前缀的请求被路由到同一实例来帮助提高性能，从而最大化 KV 缓存利用率。