智能语义路由

智能语义路由#

本用例演示了如何将 vLLM Semantic Router 与 vLLM Production Stack 集成，以创建智能模型混合 (MoM) 系统。Semantic Router 作为 Envoy 外部处理器运行，利用基于 BERT 或仅解码器的 LoRA 分类、提示保护和语义缓存，将兼容 OpenAI API 的请求语义化地路由到最合适的后端模型，从而提高质量和成本效益。

什么是 vLLM Semantic Router？#

vLLM Semantic Router 提供

模型自动选择：将数学、创意写作、代码和通用查询路由到最适合的模型
安全与隐私：PII 检测、提示保护和敏感提示的安全路由
性能优化：语义缓存和更好的工具选择，以减少延迟和 token 数量
架构：与 Envoy ExtProc 紧密集成，提供 Go 和 Rust 双重实现
监控：控制台、Grafana 仪表板、Prometheus 指标和跟踪，实现全面可见性

了解更多：vLLM Semantic Router

集成优势#

vLLM Production Stack 提供部署能力，可启动 vLLM 服务器，将流量路由到不同模型，通过 Kubernetes API 实现服务发现和容错，并支持轮询、基于会话、前缀感知、KV 感知和解耦预填充路由，同时提供 LMCache 的原生支持。

Semantic Router 添加了一个系统智能层，用于

对每个用户请求进行分类
从池中选择最合适的模型
注入特定领域的系统提示
执行语义缓存
强制执行企业级安全检查，如 PII 和越狱检测

通过结合这两个系统，您可以获得一个统一的推理堆栈，其中语义路由确保每个请求都由最佳模型回答，而 Production-Stack 路由则通过丰富的指标最大限度地提高基础设施和推理效率。

先决条件#

kubectl
Helm
Kubernetes 集群 (kind, minikube, GKE 等)
完成先决条件和快速入门

步骤 1：部署 vLLM Production Stack#

使用提供的 Helm values 文件部署 vLLM Production Stack

helm repo add vllm-production-stack https://vllm-project.github.io/production-stack
helm install vllm-stack vllm-production-stack/vllm-stack -f https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-23-SR.yaml

示例 values 文件配置了

模型：Qwen/Qwen3-8B，2 个副本
路由器：轮询路由逻辑，支持会话密钥
资源：每个实例 8 个 CPU、16Gi 内存、1 个 GPU

确定路由器的 ClusterIP 和端口

kubectl get svc vllm-router-service
# Note the router service ClusterIP and port (e.g., 10.97.254.122:80)

步骤 2：部署 vLLM Semantic Router#

遵循官方的在 Kubernetes 中安装指南，并进行更新配置。

 # Deploy vLLM Semantic Router manifests
 kubectl apply -k deploy/kubernetes/ai-gateway/semantic-router
 kubectl wait --for=condition=Available deployment/semantic-router \
   -n vllm-semantic-router-system --timeout=600s

 # Install Envoy Gateway
helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
  --version v0.0.0-latest \
  --namespace envoy-gateway-system \
  --create-namespace \
  -f https://raw.githubusercontent.com/envoyproxy/ai-gateway/main/manifests/envoy-gateway-values.yaml

 # Install Envoy AI Gateway
 helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
   --version v0.0.0-latest \
   --namespace envoy-ai-gateway-system \
   --create-namespace

 # Install Envoy AI Gateway CRDs
 helm upgrade -i aieg-crd oci://docker.io/envoyproxy/ai-gateway-crds-helm --version v0.0.0-latest --namespace envoy-ai-gateway-system

 # Wait for AI Gateway to be ready
 kubectl wait --timeout=300s -n envoy-ai-gateway-system \
   deployment/ai-gateway-controller --for=condition=Available

创建 LLM Demo 后端和 AI 网关路由

# Apply LLM demo backends
kubectl apply -f deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml
# Apply AI Gateway routes
kubectl apply -f deploy/kubernetes/ai-gateway/aigw-resources/gwapi-resources.yaml

步骤 3：测试部署#

端口转发到 Envoy 服务

export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
  --selector=gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=semantic-router \
  -o jsonpath='{.items[0].metadata.name}')

kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80

发送一个聊天补全请求

curl -i -X POST https://:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [
      {"role": "user", "content": "What is the derivative of f(x) = x^3?"}
    ]
  }'

Semantic router 将分析请求，将其识别为数学查询，并通过 vLLM Production Stack 路由器将其路由到合适的模型。

故障排除#

网关无法访问：检查网关和 Envoy 服务的状态
Semantic router 无响应：使用 kubectl logs -n vllm-semantic-router-system 检查 pod 状态和日志
返回错误代码：使用 kubectl logs 检查 production stack 路由器日志

结论#

在本用例中，我们演示了如何

部署带路由服务的 vLLM Production Stack
将 vLLM Semantic Router 与 Production Stack 集成
配置 Envoy Gateway 和 AI Gateway 以实现智能路由
测试端到端的语义路由功能

此集成提供了语义智能和生产级基础设施的强大结合，能够为各种工作负载实现高效、安全和智能的模型路由。

注意

预览版本：本指南基于 vLLM Semantic Router 集成的预览版本。随着功能的不断发展，部署步骤、配置选项和 API 接口可能会在未来的版本中发生变化。请参阅最新文档以获取更新。