跳到内容
vLLM
配置选项
正在初始化搜索
GitHub
主页
用户指南
开发者指南
基准测试
API 参考
CLI 参考
社区
vLLM
GitHub
主页
用户指南
用户指南
入门
入门
快速入门
安装
安装
GPU
CPU
TPU
示例
示例
基础
基础
离线推理
在线服务
离线推理
离线推理
异步 LLM 流式传输
音频语言
自动前缀缓存
批量LLM推理
工具对话
上下文扩展
数据并行
解耦式 Prefill V1
解耦预填充
编码器-解码器多模态
提取隐藏状态 (Extract Hidden States)
KV 加载失败恢复测试
LLM引擎示例
LLM 引擎重置 Kv
加载分片状态
自定义 Logits 处理器
带量化推理的LoRA
指标
Mistral-Small
MLPSpeculator
多LoRA推理
使用 OpenAI 批量文件格式进行离线推理
暂停/恢复
前缀缓存
前缀缓存 Flexkv
提示词嵌入推理
Qwen2.5-Omni 离线推理示例
Qwen3 Omni
Qwen 1M
可复现性
路由专家模型端到端 (Routed Experts E2E)
运行单个批次
保存分片状态
简单性能分析
引擎初始化时跳过加载权重
推测解码
结构化输出
Torchrun Dp 示例
Torchrun 示例
视觉语言
视觉语言多图像
在线服务
在线服务
API 客户端
批处理聊天补全
Helm Charts
监控仪表盘
数据并行暂停与恢复
解耦式编码器
解耦预填充
解耦服务
存算分离服务 P2P NCCL Xpyd
Ec Both 编码器
弹性端点
Gradio OpenAI 聊天机器人 Web 服务器
Gradio Web 服务器
Kv 事件订阅器
多节点服务
多实例数据并行
适用于多模态的 OpenAI 聊天补全客户端
带工具的 OpenAI 聊天补全客户端
带所需工具的 OpenAI 聊天补全客户端
带工具 Xlam 的 OpenAI 聊天补全客户端
带工具 Xlam 流式传输的 OpenAI 聊天补全客户端
带推理功能的 OpenAI 聊天补全工具调用
带推理功能的 OpenAI 聊天补全
带推理功能的 OpenAI 聊天补全流式传输
OpenAI Lid 客户端
OpenAI 实时客户端
OpenAI 实时麦克风客户端
OpenAI 响应客户端
带 Mcp 工具的 OpenAI 响应客户端
带工具的 OpenAI 响应客户端
OpenAI 转录客户端
OpenAI 翻译客户端
设置 OpenTelemetry POC
Prometheus 和 Grafana
使用 OpenAI 客户端进行 Prompt 嵌入推理
Ray Serve Deepseek
使用 Langchain 进行检索增强生成
使用 Llamaindex 进行检索增强生成
运行集群
Sagemaker-Entrypoint
Streamlit OpenAI 聊天机器人 Web 服务器
结构化输出
令牌生成客户端
实用程序
其他
其他
LMCache 示例
日志配置
张量化 vLLM 模型
池化
池化
分类
嵌入
插件
池化
评分
令牌分类
令牌嵌入
强化学习 (RL)
强化学习 (RL)
RLHF 异步新 API
RLHF HTTP IPC
RLHF HTTP NCCL
RLHF IPC
RLHF NCCL
RLHF NCCL Fsdp Ep
通用
通用
vLLM V1
常见问题
生产指标
可复现性
安全
故障排除
使用统计收集
推理与服务
推理与服务
离线推理
兼容 OpenAI 的服务器
上下文并行部署
数据并行部署
分布式部署故障排查
专家并行部署
并行与扩展
集成
集成
Claude 代码
LangChain
LlamaIndex
部署
部署
使用 Docker
使用 Kubernetes
使用 Nginx
框架
框架
Anyscale
AnythingLLM
AutoGen
BentoML
Cerebrium
Chatbox
Dify
dstack
Haystack
Helm
Hugging Face 推理端点
LiteLLM
Lobe Chat
LWS
Modal
Open WebUI
检索增强生成
RunPod
SkyPilot
Streamlit
NVIDIA Triton
集成
集成
AIBrix
NVIDIA Dynamo
KAITO
KServe
Kthena
KubeAI
KubeRay
Llama Stack
llm-d
llmaz
生产堆栈
训练
训练
异步强化学习
人类反馈强化学习
Transformer 强化学习
权重迁移
权重迁移
基类与自定义引擎
IPC 引擎
NCCL 引擎
配置
配置
内存节约
引擎参数
环境变量
模型解析
优化与调优
服务器参数
TPU
模型
模型
支持的模型
生成模型
池化模型
池化模型
分类用法
嵌入用法
奖励模型用法
评分用法
特定模型示例
Token 分类用法
Token 嵌入用法
扩展
扩展
使用 fastsafetensors 加载模型权重
使用 InstantTensor 加载模型权重
使用 Run:ai Model Streamer 加载模型
使用 CoreWeave 的 Tensorizer 加载模型
硬件支持的模型
硬件支持的模型
CPU - 英特尔® 至强®
XPU - 英特尔® GPU
TPU
特性
功能
自动前缀缓存
批处理不变性
自定义参数
自定义 Logits 处理器
解耦式编码器
解耦预填充(实验性)
交错式思考
LoRA 适配器
MooncakeConnector 使用指南
多模态输入
NixlConnector 兼容性矩阵
NixlConnector 使用指南
提示词嵌入输入
推理输出
睡眠模式
结构化输出
工具调用
量化
量化
AutoAWQ
BitsAndBytes
FP8 W8A8
GGUF
GPTQModel
Intel 量化支持
INT4 W4A16
INT8 W8A8
LLM 压缩器
NVIDIA 模型优化器
量化 KV 缓存
AMD Quark
TorchAO
推测解码
推测解码 (Speculative Decoding)
草稿模型
EAGLE 草稿模型
MLP 草稿模型
MTP (多 Token 预测)
N-Gram 推测
并行草稿模型
vLLM-Project/Speculators
后缀解码
开发者指南
开发者指南
通用
通用
弃用策略
Dockerfile
编辑 Agent 指令
增量编译工作流程
对 vLLM 进行性能分析
漏洞管理
模型实现
模型实现
基本模型
注册模型
单元测试
多模态支持
语音转文本(转录/翻译)支持
CI
持续集成 (CI)
CI 失败
vLLM 轮子 (Wheels) 的夜间构建版
更新 vLLM OSS CI/CD 上的 PyTorch 版本
设计文档
设计文档
插件
插件
IO 处理器插件
LoRA 解析器插件
插件系统
架构概述
Attention 后端特性支持
CUDA 图
视觉编码器 (ViT) CUDA 图
CustomOp
双批次重叠
如何调试 vLLM-torch.compile 集成
融合 MoE 模块化内核
融合 torch.compile 传递
与 Hugging Face 集成
混合 KV 缓存管理器
Logits 处理器
指标
多模态数据处理
模型运行器 V2 设计文档
融合 MoE 内核特性
Python 多进程
优化级别
P2P NCCL 连接器
Paged Attention
自动前缀缓存
torch.compile 集成
torch.compile 与多模态编码器
基准测试
基准测试
基准测试 CLI
参数扫描
性能仪表盘
API 参考
API 参考
vllm
vllm
beam_search
collect_env
connections
env_override
envs
exceptions
forward_context
logger
logits_process
logprobs
model_inspection
outputs
pooling_params
sampling_params
scalar_type
scripts
sequence
tasks
version
assets
assets
audio
base
image
video
benchmarks
benchmarks
datasets
latency
mm_processor
plot
serve
startup
throughput
lib
lib
endpoint_request_func
ready_checker
utils
sweep
sweep
cli
param_sweep
plot
plot_pareto
serve
serve_workload
server
startup
utils
compilation
compilation
backends
base_static_graph
caching
compiler_interface
counter
cuda_graph
decorators
monitor
partition_rules
piecewise_backend
wrapper
passes
passes
fx_utils
inductor_pass
pass_manager
vllm_inductor_pass
fusion
fusion
act_quant_fusion
allreduce_rms_fusion
attn_quant_fusion
collective_fusion
matcher_utils
mla_attn_quant_fusion
qk_norm_rope_fusion
rms_quant_fusion
rocm_aiter_fusion
rope_kvcache_fusion
sequence_parallelism
ir
ir
lowering_pass
utility
utility
fix_functionalization
noop_elimination
post_cleanup
scatter_split_replace
split_coalescing
config
config
attention
cache
compilation
device
ec_transfer
kernel
kv_events
kv_transfer
load
lora
model
model_arch
multimodal
observability
卸载
parallel
pooler
profiler
quantization
reasoning
scheduler
speculative
speech_to_text
structured_outputs
utils
vllm
weight_transfer
device_allocator
device_allocator
cumem
distributed
distributed
communication_op
kv_events
parallel_state
stateless_coordinator
utils
device_communicators
device_communicators
all2all
all_reduce_utils
base_device_communicator
cpu_communicator
cuda_communicator
cuda_wrapper
custom_all_reduce
flashinfer_all_reduce
mnnvl_compat
pynccl
pynccl_allocator
pynccl_wrapper
quick_all_reduce
ray_communicator
shm_broadcast
shm_object_storage
symm_mem
xpu_communicator
ec_transfer
ec_transfer
ec_transfer_state
ec_connector
ec_connector
base
example_connector
factory
elastic_ep
elastic_ep
elastic_execute
elastic_state
standby_state
eplb
eplb
async_worker
eplb_communicator
eplb_state
eplb_utils
rebalance_execute
policy
policy
abstract
default
kv_transfer
kv_transfer
kv_transfer_state
kv_connector
kv_connector
base
factory
utils
v1
v1
base
decode_bench_connector
example_connector
example_hidden_states_connector
flexkv_connector
lmcache_connector
lmcache_mp_connector
metrics
multi_connector
nixl_connector
offloading_connector
simple_cpu_offload_connector
ssm_conv_transfer_utils
hf3fs
hf3fs
hf3fs_client
hf3fs_connector
hf3fs_metadata_server
utils
utils
common
gather_scatter_helper
hf3fs_mock_client
lmcache_integration
lmcache_integration
multi_process_adapter
utils
vllm_v1_adapter
mooncake
mooncake
mooncake_connector
mooncake_utils
moriio
moriio
moriio_common
moriio_connector
moriio_engine
offloading
offloading
common
metrics
scheduler
worker
p2p
p2p
p2p_nccl_connector
p2p_nccl_engine
tensor_memory_pool
weight_transfer
weight_transfer
base
factory
ipc_engine
nccl_engine
packed_tensor
engine
engine
arg_utils
async_llm_engine
llm_engine
protocol
entrypoints
entrypoints
api_server
chat_utils
constants
grpc_server
launcher
llm
logger
ssl
utils
anthropic
anthropic
api_router
protocol
serving
cli
cli
collect_env
launch
main
openai
run_batch
serve
types
benchmark
benchmark
base
latency
main
mm_processor
serve
startup
sweep
throughput
mcp
mcp
tool
tool_server
openai
openai
api_server
cli_args
orca_metrics
run_batch
server_utils
utils
chat_completion
chat_completion
api_router
batch_serving
protocol
serving
stream_harmony
completion
completion
api_router
protocol
serving
engine
engine
protocol
serving
generate
generate
api_router
generative_scoring
generative_scoring
api_router
serving
models
models
api_router
protocol
serving
parser
parser
harmony_utils
responses_parser
realtime
realtime
api_router
connection
metrics
protocol
serving
responses
responses
api_router
context
harmony
protocol
serving
streaming_events
utils
speech_to_text
speech_to_text
api_router
protocol
serving
speech_to_text
pooling
pooling
io_processor_factories
typing
utils
base
base
io_processor
protocol
serving
classify
classify
api_router
io_processor
protocol
serving
embed
embed
api_router
io_processor
protocol
serving
pooling
pooling
api_router
io_processor
protocol
serving
scoring
scoring
api_router
io_processor
protocol
serving
typing
utils
sagemaker
sagemaker
api_router
serve
serve
cache
cache
api_router
disagg
disagg
api_router
protocol
serving
elastic_ep
elastic_ep
api_router
middleware
instrumentator
instrumentator
basic
health
metrics
offline_docs
server_info
lora
lora
api_router
protocol
profile
profile
api_router
render
render
api_router
serving
rlhf
rlhf
api_router
rpc
rpc
api_router
sleep
sleep
api_router
tokenize
tokenize
api_router
protocol
serving
inputs
inputs
engine
llm
preprocess
ir
ir
op
util
ops
ops
layernorm
kernels
kernels
aiter_ops
oink_ops
vllm_c
xpu_ops
helion
helion
config_manager
register
utils
ops
ops
silu_mul_fp8
logging_utils
logging_utils
access_log_filter
dump_input
formatter
lazy
log_time
torch_tensor
lora
lora
lora_model
lora_weights
model_manager
peft_helper
request
resolver
utils
worker_manager
layers
layers
base
base_linear
column_parallel_linear
fused_moe
logits_processor
replicated_linear
row_parallel_linear
utils
vocal_parallel_embedding
ops
ops
torch_ops
torch_ops
lora_ops
triton_ops
triton_ops
fp8_kernel_utils
fused_moe_lora_fp8_op
fused_moe_lora_op
kernel_utils
lora_expand_fp8_op
lora_expand_op
lora_kernel_metadata
lora_shrink_fp8_op
lora_shrink_op
utils
xpu_ops
xpu_ops
lora_ops
punica_wrapper
punica_wrapper
punica_base
punica_cpu
punica_gpu
punica_selector
punica_xpu
utils
model_executor
model_executor
custom_op
parameter
utils
kernels
kernels
linear
linear
base
mixed_precision
mixed_precision
allspark
conch
cpu
cutlass
dynamic_4bit
exllama
MPLinearKernel
machete
marlin
triton_w4a16
xpu
nvfp4
nvfp4
base
cutlass
emulation
fbgemm
flashinfer
marlin
scaled_mm
scaled_mm
aiter
BlockScaledMMLinearKernel
cpu
cutlass
deep_gemm
flashinfer
marlin
pytorch
rocm
ScaledMMLinearKernel
triton
xpu
layers
layers
activation
attention_layer_base
batch_invariant
conv
kda
layernorm
lightning_attn
linear
logits_processor
mla
resampler
sparse_attn_indexer
utils
vocab_parallel_embedding
attention
attention
attention
chunked_local_attention
cross_attention
encoder_only_attention
kv_transfer_utils
mla_attention
mm_encoder_attention
static_sink_attention
fla
fla
ops
ops
chunk
chunk_delta_h
chunk_o
chunk_scaled_dot_kkt
cumsum
fused_gdn_prefill_post_conv
fused_recurrent
fused_sigmoid_gating
index
kda
l2norm
layernorm_guard
op
solve_tril
utils
wy_fast
fused_moe
fused_moe
activation
all2all_utils
config
cpu_fused_moe
cutlass_moe
deep_gemm_utils
fallback
flashinfer_cutlass_moe
fused_batched_moe
fused_marlin_moe
fused_moe
fused_moe_method_base
fused_moe_modular_method
gpt_oss_triton_kernels_moe
layer
modular_kernel
moe_align_block_size
moe_permute_unpermute
mori_prepare_finalize
nixl_ep_prepare_finalize
rocm_aiter_fused_moe
routed_experts_capturer
shared_fused_moe
topk_weight_and_reduce
triton_cutlass_moe
triton_deep_gemm_moe
unquantized_fused_moe_method
utils
xpu_fused_moe
zero_expert_fused_moe
experts
experts
batched_deep_gemm_moe
deep_gemm_moe
flashinfer_cutedsl_batched_moe
flashinfer_cutedsl_moe
trtllm_bf16_moe
trtllm_fp8_moe
trtllm_mxfp4_moe
trtllm_nvfp4_moe
oracle
oracle
fp8
mxfp4
mxfp8
nvfp4
unquantized
prepare_finalize
prepare_finalize
deepep_ht
deepep_ll
flashinfer_nvlink_one_sided
flashinfer_nvlink_two_sided
naive_dp_ep
no_dp_ep
router
router
base_router
custom_routing_router
fused_moe_router
fused_topk_bias_router
fused_topk_router
gate_linear
grouped_topk_router
router_factory
routing_simulator_router
runner
runner
chunking_moe_runner
default_moe_runner
moe_runner
moe_runner_base
moe_runner_factory
shared_experts
mamba
mamba
abstract
gdn_linear_attn
linear_attn
mamba_mixer
mamba_mixer2
mamba_utils
short_conv
ops
ops
causal_conv1d
layernorm_gated
mamba_ssm
ssd_bmm
ssd_chunk_scan
ssd_chunk_state
ssd_combined
ssd_state_passing
triton_helpers
pooler
pooler
abstract
activations
common
special
seqwise
seqwise
heads
methods
poolers
tokwise
tokwise
heads
methods
poolers
quantization
quantization
awq
awq_marlin
awq_triton
base_config
bitsandbytes
cpu_wna16
experts_int8
fbgemm_fp8
fp8
fp_quant
gguf
gptq
gptq_marlin
inc
input_quant_fp8
kv_cache
modelopt
moe_wna16
mxfp4
mxfp8
qutlass_utils
schema
torchao
compressed_tensors
compressed_tensors
compressed_tensors
triton_scaled_mm
utils
compressed_tensors_moe
compressed_tensors_moe
compressed_tensors_moe
compressed_tensors_moe_w4a4_mxfp4
compressed_tensors_moe_w4a4_nvfp4
compressed_tensors_moe_w4a8_fp8
compressed_tensors_moe_w4a8_int8
compressed_tensors_moe_w8a8_fp8
compressed_tensors_moe_w8a8_int8
compressed_tensors_moe_wna16
compressed_tensors_moe_wna16_marlin
schemes
schemes
compressed_tensors_24
compressed_tensors_scheme
compressed_tensors_w4a4_nvfp4
compressed_tensors_w4a8_fp8
compressed_tensors_w4a8_int
compressed_tensors_w4a16_mxfp4
compressed_tensors_w4a16_nvfp4
compressed_tensors_w8a8_fp8
compressed_tensors_w8a8_int8
compressed_tensors_w8a16_fp8
compressed_tensors_wNa16
transform
transform
linear
module
utils
schemes
schemes
linear_qutlass_nvfp4
online
online
base
fp8
quark
quark
quark
quark_moe
utils
schemes
schemes
quark_ocp_mx
quark_scheme
quark_w4a8_mxfp4_fp8
quark_w8a8_fp8
quark_w8a8_int8
utils
utils
allspark_utils
flashinfer_fp4_moe
flashinfer_mxint4_moe
flashinfer_utils
fp8_utils
gptq_utils
int8_utils
layer_utils
machete_utils
marlin_utils
marlin_utils_fp4
marlin_utils_fp8
marlin_utils_test
mxfp4_utils
mxfp6_utils
mxfp8_utils
nvfp4_emulation_utils
nvfp4_utils
ocp_mx_utils
quant_utils
w8a8_utils
rotary_embedding
rotary_embedding
base
common
deepseek_scaling_rope
dual_chunk_rope
dynamic_ntk_alpha_rope
dynamic_ntk_scaling_rope
ernie45_vl_rope
fope
gemma4_rope
linear_scaling_rope
llama3_rope
llama4_vision_rope
mrope
mrope_interleaved
ntk_scaling_rope
phi3_long_rope_scaled_rope
telechat3_scaling_rope
xdrope
yarn_scaling_rope
model_loader
model_loader
base_loader
bitsandbytes_loader
default_loader
dummy_loader
ep_weight_filter
gguf_loader
runai_streamer_loader
sharded_state_loader
tensorizer
tensorizer_loader
utils
weight_utils
reload
reload
layerwise
meta
sanitize
torchao_decorator
types
utils
models
models
AXK1
adapters
afmoe
aimv2
apertus
arcee
arctic
aria
audioflamingo3
aya_vision
bagel
baichuan
bailing_moe
bailing_moe_linear
bamba
bee
bert
bert_with_rope
blip
blip2
bloom
chameleon
chatglm
cheers
clip
cohere2_vision
cohere_asr
colbert
colmodernvbert
colpali
colqwen3
colqwen3_5
commandr
config
conformer_encoder
dbrx
deepencoder
deepencoder2
deepseek_eagle
deepseek_eagle3
deepseek_mtp
deepseek_ocr
deepseek_ocr2
deepseek_v2
deepseek_vl2
dots1
dots_ocr
eagle2_5_vl
ernie
ernie45
ernie45_moe
ernie45_vl
ernie45_vl_moe
ernie_mtp
exaone
exaone4
exaone4_5
exaone4_5_mtp
exaone_moe
exaone_moe_mtp
extract_hidden_states
fairseq2_llama
falcon
falcon_h1
fireredasr2
fireredlid
flex_olmo
funasr
funaudiochat
fuyu
gemma
gemma2
gemma3
gemma3_mm
gemma3n
gemma3n_audio_utils
gemma3n_mm
gemma4
gemma4_mm
glm
glm4
glm4_1v
glm4_moe
glm4_moe_lite
glm4_moe_lite_mtp
glm4_moe_mtp
glm4v
glm_ocr
glm_ocr_mtp
glmasr
glmasr_utils
gpt2
gpt_bigcode
gpt_j
gpt_neox
gpt_oss
granite
granite_speech
granitemoe
granitemoehybrid
granitemoeshared
gritlm
grok1
h2ovl
hunyuan_v1
hunyuan_vision
hyperclovax
hyperclovax_vision
hyperclovax_vision_v2
idefics2_vision_model
idefics3
interfaces
interfaces_base
intern_vit
internlm2
internlm2_ve
interns1
interns1_pro
interns1_vit
internvl
iquest_loopcoder
isaac
jais
jais2
jamba
jina_vl
kanana_v
keye
keye_vl1_5
kimi_audio
kimi_k25
kimi_k25_vit
kimi_linear
kimi_vl
lfm2
lfm2_moe
lfm2_siglip2
lfm2_vl
lightonocr
llama
llama4
llama4_eagle
llama_eagle
llama_eagle3
llava
llava_next
llava_next_video
llava_onevision
longcat_flash
longcat_flash_mtp
mamba
mamba2
medusa
midashenglm
mimo
mimo_mtp
mimo_v2_flash
minicpm
minicpm3
minicpm_eagle
minicpmo
minicpmv
minimax_m2
minimax_text_01
minimax_vl_01
mistral
mistral3
mistral_large_3
mistral_large_3_eagle
mixtral
mllama4
mlp_speculator
modernbert
module_mapping
molmo
molmo2
moonvit
mpt
musicflamingo
nano_nemotron_vl
nemotron
nemotron_h
nemotron_h_mtp
nemotron_nas
nemotron_parse
nemotron_vl
nvlm_d
olmo
olmo2
olmo_hybrid
olmoe
opencua
openpangu
openpangu_mtp
openpangu_vl
opt
orion
ouro
ovis
ovis2_5
paddleocr_vl
paligemma
parakeet
param2moe
persimmon
phi
phi3
phi3v
phi4mm
phi4mm_audio
phi4mm_utils
phi4siglip
phimoe
pixtral
plamo2
plamo3
qwen
qwen2
qwen2_5_omni_thinker
qwen2_5_vl
qwen2_audio
qwen2_moe
qwen2_rm
qwen2_vl
qwen3
qwen3_5
qwen3_5_mtp
qwen3_asr
qwen3_asr_forced_aligner
qwen3_asr_realtime
qwen3_dflash
qwen3_moe
qwen3_next
qwen3_next_mtp
qwen3_omni_moe_thinker
qwen3_vl
qwen3_vl_moe
qwen_vl
radio
registry
roberta
rvl
sarvam
seed_oss
siglip
siglip2navit
skyworkr1v
smolvlm
solar
stablelm
starcoder2
step1
step3_text
step3_vl
step3p5
step3p5_mtp
step_vl
tarsier
telechat2
teleflm
terratorch
ultravox
utils
vision
voxtral
voxtral_realtime
voyage
whisper
whisper_causal
whisper_utils
zamba2
transformers
transformers
base
causal
legacy
moe
multimodal
pooling
utils
offloader
offloader
base
prefetch
prefetch_ops
uva
warmup
warmup
deep_gemm_warmup
kernel_warmup
multimodal
multimodal
audio
cache
encoder_budget
evs
hasher
image
inputs
parse
registry
utils
video
media
media
audio
base
connector
image
video
processing
processing
context
dummy_inputs
inputs
processor
parser
parser
abstract_parser
minimax_m2_parser
parser_manager
platforms
platforms
cpu
cuda
interface
rocm
tpu
xpu
zen_cpu
plugins
plugins
io_processors
io_processors
interface
lora_resolvers
lora_resolvers
filesystem_resolver
hf_hub_resolver
profiler
profiler
layerwise_profile
utils
wrapper
ray
ray
lazy_utils
ray_env
reasoning
reasoning
abs_reasoning_parsers
basic_parsers
deepseek_r1_reasoning_parser
deepseek_v3_reasoning_parser
ernie45_reasoning_parser
gemma4_reasoning_parser
gemma4_utils
gptoss_reasoning_parser
granite_reasoning_parser
hunyuan_a13b_reasoning_parser
identity_reasoning_parser
kimi_k2_reasoning_parser
minimax_m2_reasoning_parser
mistral_reasoning_parser
nemotron_v3_reasoning_parser
olmo3_reasoning_parser
qwen3_reasoning_parser
seedoss_reasoning_parser
step3_reasoning_parser
step3p5_reasoning_parser
renderers
renderers
base
deepseek_v32
embed_utils
grok2
hf
mistral
params
registry
terratorch
inputs
inputs
preprocess
tokenize
tokenizers
tokenizers
deepseek_v32
deepseek_v32_encoding
detokenizer_utils
grok2
hf
kimi_audio
mistral
protocol
qwen_vl
registry
tool_parsers
tool_parsers
abstract_tool_parser
deepseekv3_tool_parser
deepseekv31_tool_parser
deepseekv32_tool_parser
ernie45_tool_parser
functiongemma_tool_parser
gemma4_tool_parser
gemma4_utils
gigachat3_tool_parser
glm4_moe_tool_parser
glm47_moe_tool_parser
granite4_tool_parser
granite_20b_fc_tool_parser
granite_tool_parser
hermes_tool_parser
hunyuan_a13b_tool_parser
internlm2_tool_parser
jamba_tool_parser
kimi_k2_tool_parser
llama4_pythonic_tool_parser
llama_tool_parser
longcat_tool_parser
minimax_m2_tool_parser
minimax_tool_parser
mistral_tool_parser
olmo3_tool_parser
openai_tool_parser
phi4mini_tool_parser
pythonic_tool_parser
qwen3coder_tool_parser
qwen3xml_tool_parser
seed_oss_tool_parser
step3_tool_parser
step3p5_tool_parser
utils
xlam_tool_parser
tracing
tracing
otel
utils
transformers_utils
transformers_utils
config
config_parser_base
dynamic_module
gguf_utils
model_arch_config_convertor
processor
repo_utils
runai_utils
s3_utils
tokenizer
utils
chat_templates
chat_templates
registry
configs
configs
AXK1
afmoe
arctic
bagel
chatglm
cheers
colmodernvbert
colpali
colqwen3
deepseek_vl2
dotsocr
eagle
extract_hidden_states
falcon
fireredlid
flex_olmo
funaudiochat
hunyuan_vl
hyperclovax
isaac
jais
kimi_k25
kimi_linear
kimi_vl
lfm2_moe
medusa
midashenglm
mistral
mlp_speculator
moonvit
nemotron
nemotron_h
olmo_hybrid
ovis
parakeet
qwen3_5
qwen3_5_moe
qwen3_asr
qwen3_next
radio
step3_vl
step3p5
tarsier2
ultravox
speculators
speculators
algos
base
processors
processors
bagel
cheers
cohere_asr
deepseek_ocr
deepseek_vl2
fireredasr2
fireredlid
funasr
glm4v
h2ovl
hunyuan_vl
hunyuan_vl_image
internvl
isaac
kimi_audio
kimi_k25
nano_nemotron_vl
nemotron_vl
nvlm_d
ovis
ovis2_5
pixtral
qwen3_asr
qwen_vl
step3_vl
voxtral
triton_utils
triton_utils
allocation
importing
usage
usage
usage_lib
utils
utils
argparse_utils
async_utils
cache
collection_utils
counter
cpu_triton_utils
deep_gemm
flashinfer
func_utils
gc_utils
hashing
import_utils
jsontree
math_utils
mem_constants
mem_utils
mistral
multi_stream_utils
nccl
network_utils
numa_utils
nvtx_pytorch_hooks
ompmultiprocessing
platform_utils
print_utils
profiling
registry
serial_utils
system_utils
tensor_schema
torch_utils
tqdm_utils
v1
v1
cudagraph_dispatcher
kv_cache_interface
outputs
request
serial_utils
utils
attention
attention
backend
selector
backends
backends
cpu_attn
fa_utils
flash_attn
flash_attn_diffkv
flashinfer
flex_attention
gdn_attn
linear_attn
mamba1_attn
mamba2_attn
mamba_attn
registry
rocm_aiter_fa
rocm_aiter_unified_attn
rocm_attn
short_conv_attn
tree_attn
triton_attn
utils
mla
mla
aiter_triton_mla
cutlass_mla
flashattn_mla
flashinfer_mla
flashinfer_mla_sparse
flashmla
flashmla_sparse
indexer
rocm_aiter_mla
rocm_aiter_mla_sparse
sparse_utils
triton_mla
xpu_mla_sparse
ops
ops
chunked_prefill_paged_decode
common
dcp_alltoall
flashmla
merge_attn_states
paged_attn
prefix_prefill
rocm_aiter_mla_sparse
triton_decode_attention
triton_merge_attn_states
triton_prefill_attention
triton_reshape_and_cache_flash
triton_unified_attention
vit_attn_wrappers
xpu_mla_sparse
core
core
block_pool
encoder_cache_manager
kv_cache_coordinator
kv_cache_manager
kv_cache_metrics
kv_cache_utils
single_type_kv_cache_manager
sched
sched
async_scheduler
interface
output
request_queue
scheduler
utils
engine
engine
async_llm
coordinator
core
core_client
detokenizer
exceptions
input_processor
llm_engine
logprobs
output_processor
parallel_sampling
tensor_ipc
utils
executor
executor
abstract
multiproc_executor
ray_env_utils
ray_executor
ray_executor_v2
ray_utils
uniproc_executor
kv_offload
kv_offload
abstract
factory
mediums
reuse_manager
spec
cpu
cpu
manager
spec
policies
policies
abstract
arc
lru
worker
worker
cpu_gpu
worker
metrics
metrics
loggers
perf
prometheus
ray_wrappers
reader
stats
utils
pool
pool
late_interaction
metadata
sample
sample
metadata
rejection_sampler
sampler
logits_processor
logits_processor
builtin
interface
state
ops
ops
bad_words
logprobs
penalties
topk_topp_sampler
topk_topp_triton
simple_kv_offload
simple_kv_offload
copy_backend
cuda_mem_ops
manager
metadata
worker
spec_decode
spec_decode
dflash
draft_model
eagle
extract_hidden_states
medusa
metadata
metrics
ngram_proposer
ngram_proposer_gpu
suffix_decoding
utils
structured_output
structured_output
backend_guidance
backend_lm_format_enforcer
backend_outlines
backend_types
backend_xgrammar
request
utils
worker
worker
block_table
cp_utils
cpu_model_runner
cpu_worker
dp_utils
ec_connector_model_runner_mixin
encoder_cudagraph
encoder_cudagraph_defs
gpu_input_batch
gpu_model_runner
gpu_ubatch_wrapper
gpu_worker
kv_connector_model_runner_mixin
lora_model_runner_mixin
mamba_utils
tpu_input_batch
ubatch_utils
ubatching
utils
worker_base
workspace
xpu_model_runner
xpu_worker
gpu
gpu
async_utils
attn_utils
block_table
buffer_utils
cp_utils
cudagraph_utils
dp_utils
eplb_utils
input_batch
kv_connector
lora_utils
model_runner
pp_utils
states
structured_outputs
warmup
metrics
metrics
logits
mm
mm
encoder_cache
encoder_runner
rope
model_states
model_states
default
interface
whisper
pool
pool
late_interaction_runner
pooling_runner
sample
sample
bad_words
gumbel
logit_bias
logprob
min_p
output
penalties
prompt_logprob
sampler
states
spec_decode
spec_decode
probabilistic_rejection_sampler_utils
rejection_sampler
synthetic_rejection_sampler_utils
utils
eagle
eagle
cudagraph
eagle3_utils
speculator
utils
CLI 参考
CLI 参考
vllm serve
vllm chat
vllm complete
vllm run-batch
vllm bench
vllm bench
vllm bench latency
vllm bench mm-processor
vllm bench serve
vllm bench sweep plot
vllm bench sweep plot_pareto
vllm bench sweep serve
vllm bench sweep serve_workload
vllm bench throughput
社区
社区
联系我们
Meetups
赞助商
治理
治理
协作策略
贡献者
治理流程
博客
论坛
Slack
主页
用户指南
配置
配置选项
¶
本节列出了运行 vLLM 时最常用的选项。
配置分为三个主要级别,优先级从高到低排列:
请求参数
和
输入参数
引擎参数
环境变量
回到顶部