CustomOp¶
CustomOp 是一个抽象类,用于将各种操作(Operation)的 forward 方法分发到相应的后端。它还为 vLLM 和 OOT(树外/外置)插件提供了一种注册其自定义操作的机制。
本文档将介绍 CustomOp 在 vLLM 中的工作原理,以及如何实现一个新的 CustomOp。
vLLM 中 CustomOp 的工作原理¶
CustomOp 在其类中管理着两个字典,分别存放 vLLM 和 OOT 插件的所有自定义操作(即按注册名称索引的操作类)。
我们可以使用 @CustomOp.register("op_name") 将操作类注册到 CustomOp 系统中。之后,op_name 及其类将被添加到 op_registry 字典中。此外,我们还可以通过 @CustomOp.register_oot("op_name") 注册 OOT 操作。我们稍后将详细介绍该机制。
当 CustomOp 被调用(即调用其 forward() 方法)时,如果它被启用(即使用 --compilation_config.custom_ops '["+op_name"]'),它将根据 current_platform 自动将 forward 方法分发到适当的后端。否则(即它被禁用),它将仅调用 forward_native() 方法来使用该 forward 方法的 PyTorch 原生实现。
- CPU 平台:分发至
forward_cpu()。 - CUDA 平台:分发至
forward_cuda()。 - ROCm 平台:分发至
forward_hip()。如果未实现forward_hip(),则将使用forward_cuda()作为回退。 - XPU 平台:分发至
forward_xpu()。 - TPU 平台:分发至
forward_tpu()。 - OOT 平台:分发至
forward_oot()。这仅在 OOT 平台上调用。 - 默认:作为所有平台的最终回退,分发至
forward_native()。
注意
请注意,由于类继承的关系,分发逻辑可能并非绝对。派生类可能会重写此行为。
此外,vLLM 根据 compilation_config.custom_ops 决定是启用还是禁用 CustomOp。具体来说,如果一个 CustomOp 没有在 compilation_config.custom_ops 中注册(即使用默认配置),那么如果 compilation_config.custom_ops 包含 all,它将被启用;如果包含 none,则被禁用。
注意
请注意,all 和 none 不能同时存在于 compilation_config.custom_ops 中。
默认情况下,如果 compilation_config.backend == "inductor" 且 compilation_config.mode != CompilationMode.NONE,则会将 none 添加到 compilation_config.custom_ops 中,否则会添加 all。换句话说,这意味着在某些使用 inductor 作为 torch.compile 默认后端的平台上,运行 torch compile 模式时 CustomOp 将被禁用。在这种情况下,Inductor 会为这些禁用的自定义操作生成(融合的)Triton 内核。
注意
对于多模态模型,vLLM 强制启用了某些自定义操作,以便在 ViT 部分使用设备特定的深度优化内核以获得更好的性能,例如 MMEncoderAttention 和 ApplyRotaryEmb。我们也可以向 CustomOp 的 __init__() 方法传递 enforce_enable=True 参数,以在对象级别强制启用它。
请注意,在我们为多模态部分添加单独的 compilation_config 后,此 enforce_enable 机制将被移除。
如何为 CustomOp 自定义配置¶
vLLM 还通过在启动服务器时手动传递 --compilation_config.custom_ops '["..."]',为用户提供对哪些自定义操作启用或禁用的细粒度控制。
例如
- 使用
--compilation_config.custom_ops '["all"]'启用所有自定义操作。 - 使用
--compilation_config.custom_ops '["none"]'禁用所有自定义操作。 - 使用
--compilation_config.custom_ops '["all,-op1"]'启用除 op1 之外的所有自定义操作(即前缀为-表示“禁用”)。 - 使用
--compilation_config.custom_ops '["none,+op1,+op2"]'仅启用 op1 和 op2(即前缀为+表示“启用”)。
vLLM 中支持的 CustomOp 类型¶
1. Attention(注意力)
@PluggableLayer.register("multi_head_latent_attention")
class MultiHeadLatentAttentionWrapper(PluggableLayer):
"""Pluggable MLA layer which allows OOT backends to add
custom implementations of the outer MLA layer (including rope & o_proj).
Note that currently oot platforms can still use CustomOp.register_oot to
replace MLA layer entirely, although we use PluggableLayer to register
this layer now.
This class takes positions and hidden_states as input.
The input tensors can either contain prefill tokens or decode tokens.
The class does the following:
1. MLA Preprocess.
2. Perform multi-head attention to prefill tokens and
multi-query attention to decode tokens separately.
3. Return the output tensor.
"""
2. Activation(激活函数)
@CustomOp.register("silu_and_mul")
class SiluAndMul(CustomOp):
"""An activation function for SwiGLU.
The function computes x -> silu(x[:d]) * x[d:] where d = x.shape[-1] // 2.
Shapes:
x: (num_tokens, 2 * d) or (batch_size, seq_len, 2 * d)
return: (num_tokens, d) or (batch_size, seq_len, d)
"""
@CustomOp.register("mul_and_silu")
class MulAndSilu(CustomOp):
"""An activation function for SwiGLU.
The function computes x -> x[:d] * silu(x[d:]) where d = x.shape[-1] // 2.
Shapes:
x: (num_tokens, 2 * d) or (batch_size, seq_len, 2 * d)
return: (num_tokens, d) or (batch_size, seq_len, d)
"""
@CustomOp.register("gelu_new")
class NewGELU(CustomOp):
@CustomOp.register("gelu_fast")
class FastGELU(CustomOp):
@CustomOp.register("quick_gelu")
class QuickGELU(CustomOp):
# https://github.com/huggingface/transformers/blob/main/src/transformers/activations.py#L90
@CustomOp.register("gelu_and_mul")
class GeluAndMul(CustomOp):
"""An activation function for GeGLU.
The function computes x -> GELU(x[:d]) * x[d:] where d = x.shape[-1] // 2.
Shapes:
x: (batch_size, seq_len, 2 * d) or (num_tokens, 2 * d)
return: (batch_size, seq_len, d) or (num_tokens, d)
"""
@CustomOp.register("gelu_and_mul_sparse")
class GeluAndMulSparse(CustomOp):
"""An activation function for GeluAndMulSparse.
This activation function is used in Gemma3n. It computes:
up_proj = self.up_proj(x)
gate_proj = self.gate_proj(x)
gate_proj = self._gaussian_topk(gate_proj) # sparsity
activations = self.act_fn(gate_proj) # gelu
down_proj = self.down_proj(activations * up_proj)
Shapes:
x: (num_tokens, 2 * d) or (batch_size, seq_len, 2 * d)
return: (num_tokens, d) or (batch_size, seq_len, d)
"""
@CustomOp.register("relu2")
class ReLUSquaredActivation(CustomOp):
"""
Applies the relu^2 activation introduced in https://arxiv.org/abs/2109.08668v2
"""
@CustomOp.register("xielu")
class XIELU(CustomOp):
"""
Applies the xIELU activation function introduced in https://arxiv.org/abs/2411.13010
If the user has installed the nickjbrowning/XIELU, we import xIELU CUDA
Otherwise, we emit a single warning and use xIELU Python
"""
@CustomOp.register("swigluoai_and_mul")
class SwigluOAIAndMul(CustomOp):
# https://github.com/huggingface/transformers/blob/v4.55.0/src/transformers/models/gpt_oss/modeling_gpt_oss.py#L106-L110
@CustomOp.register("fatrelu_and_mul")
class FatreluAndMul(CustomOp):
"""An activation function for FATReLU.
The function computes x -> FATReLU(x[:d]) * x[d:] where
d = x.shape[-1] // 2.
This is used in openbmb/MiniCPM-S-1B-sft.
Shapes:
x: (num_tokens, 2 * d) or (batch_size, seq_len, 2 * d)
return: (num_tokens, d) or (batch_size, seq_len, d)
"""
3. MM-Conv(多模态卷积)
@CustomOp.register("conv2d")
class Conv2dLayer(ConvLayerBase):
"""Conv layer with Conv2d."""
@CustomOp.register("conv3d")
class Conv3dLayer(ConvLayerBase):
"""Conv layer with Conv3d."""
4. Embedding(嵌入)
@PluggableLayer.register("vocab_parallel_embedding")
class VocabParallelEmbedding(PluggableLayer):
"""Embedding parallelized in the vocabulary dimension.
Adapted from torch.nn.Embedding, note that we pad the vocabulary size to
make sure it is divisible by the number of model parallel GPUs.
In order to support various loading methods, we ensure that LoRA-added
embeddings are always at the end of TP-sharded tensors. In other words,
we shard base embeddings and LoRA embeddings separately (both padded),
and place them in the same tensor.
In this example, we will have the original vocab size = 1010,
added vocab size = 16 and padding to 64. Therefore, the total
vocab size with padding will be 1088 (because we first pad 1010 to
1024, add 16, and then pad to 1088).
Therefore, the tensor format looks like the following:
TP1, rank 0 (no sharding):
|< --------BASE-------- >|< -BASE PADDING-- >|< -----LORA------ >|< -LORA PADDING-- >|
corresponding token_id: | 0 | 1 | ... | 1009 | -1 | ... | -1 | 1010 | ... | 1025 | -1 | ... | -1 |
index: | 0 | 1 | ... | 1009 | 1010 | ... | 1023 | 1024 | ... | 1039 | 1040 | ... | 1087 |
TP2, rank 0:
|< --------------------BASE--------------------- >|< -----LORA------ >|< -LORA PADDING- >|
corresponding token_id: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 1010 | ... | 1025 | -1 | ... | -1 |
index: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 512 | ... | 527 | 528 | ... | 543 |
TP2, rank 1:
|< -----------BASE----------- >|< -BASE PADDING- >|< -----------LORA PADDING----------- >|
corresponding token_id: | 512 | 513 | 514 | ... | 1009 | -1 | ... | -1 | -1 | ... | -1 | -1 | ... | -1 |
index: | 0 | 1 | 2 | ... | 497 | 498 | ... | 511 | 512 | ... | 527 | 528 | ... | 543 |
Args:
num_embeddings: vocabulary size.
embedding_dim: size of hidden state.
params_dtype: type of the parameters.
org_num_embeddings: original vocabulary size (without LoRA).
padding_size: padding size for the vocabulary.
quant_config: quant config for the layer
prefix: full name of the layer in the state dict
""" # noqa: E501
@PluggableLayer.register("parallel_lm_head")
class ParallelLMHead(VocabParallelEmbedding):
"""Parallelized LM head.
Output logits weight matrices used in the Sampler. The weight and bias
tensors are padded to make sure they are divisible by the number of
model parallel GPUs.
Args:
num_embeddings: vocabulary size.
embedding_dim: size of hidden state.
bias: whether to use bias.
params_dtype: type of the parameters.
org_num_embeddings: original vocabulary size (without LoRA).
padding_size: padding size for the vocabulary.
"""
5. Linear(线性层)
@PluggableLayer.register("row_parallel_linear")
class RowParallelLinear(LinearBase):
"""Linear layer with row parallelism.
The linear layer is defined as Y = XA + b. A is parallelized along
its first dimension and X along its second dimension as:
- -
| A_1 |
| . |
A = | . | X = [X_1, ..., X_p]
| . |
| A_p |
- -
Arguments:
input_size: first dimension of matrix A.
output_size: second dimension of matrix A.
bias: If true, add bias. Note that bias is not parallelized.
input_is_parallel: If true, we assume that the input is already
split across the GPUs and we do not split
again.
skip_bias_add: This was added to enable performance optimization where
bias can be fused with other element-wise operations.
We skip adding bias but instead return it.
params_dtype: Data type for the parameters.
reduce_results: If true, call all-reduce on output and make Y available
to all GPUs, otherwise, every GPU will have its output
which is Y = X_iA_i
quant_config: Quantization configure.
prefix: The name of the layer in the state dict, including all parents
(e.g. model.layers.0.down_proj)
return_bias: If true, return bias together with outputs in forward pass.
disable_tp: If true, weights matrix won't be sharded through tp rank.
"""
@PluggableLayer.register("column_parallel_linear")
class ColumnParallelLinear(LinearBase):
"""Linear layer with column parallelism.
The linear layer is defined as Y = XA + b. A is parallelized along
its second dimension as A = [A_1, ..., A_p].
Args:
input_size: first dimension of matrix A.
output_size: second dimension of matrix A.
bias: If true, add bias.
gather_output: If true, call all-gather on output and make Y available
to all GPUs, otherwise, every GPU will have its output
which is Y_i = XA_i
skip_bias_add: This was added to enable performance optimizations where
bias can be fused with other element-wise operations. we
skip adding bias but instead return it.
params_dtype: Data type for the parameters.
quant_config: Quantization configure.
prefix: The name of the layer in the state dict, including all parents
(e.g. model.layers.0.qkv_proj)
return_bias: If true, return bias together with outputs in forward pass.
disable_tp: If true, weights matrix won't be sharded through tp rank.
"""
@PluggableLayer.register("replicated_linear")
class ReplicatedLinear(LinearBase):
"""Replicated linear layer.
Args:
input_size: input dimension of the linear layer.
output_size: output dimension of the linear layer.
bias: If true, add bias.
skip_bias_add: If true, skip adding bias but instead return it.
params_dtype: Data type for the parameters.
quant_config: Quantization configure.
prefix: The name of the layer in the state dict, including all parents
(e.g. model.layers.0.qkv_proj)
return_bias: If true, return bias together with outputs in forward pass.
disable_tp: Take no effect for replicated linear layers.
"""
6. Logits Processor(Logits 处理器)
@PluggableLayer.register("logits_processor")
class LogitsProcessor(PluggableLayer):
"""Process logits and apply logits processors from sampling metadata.
This layer does the following:
1. Gather logits from model hidden_states.
2. Scale logits if needed.
3. Apply logits processors (if any).
"""
7. Mamba
@PluggableLayer.register("mamba_mixer")
class MambaMixer(MambaBase, PluggableLayer):
"""
Compute ∆, A, B, C, and D the state space parameters and compute
the `contextualized_states`. A, D are input independent
(see Mamba paper [1] Section 3.5.2 "Interpretation of A"
for why A isn't selective) ∆, B, C are input-dependent
(this is a key difference between Mamba and the linear time
invariant S4, and is why Mamba is called
**selective** state spaces)
"""
@PluggableLayer.register("mamba_mixer2")
class MambaMixer2(MambaBase, PluggableLayer):
"""
Compute ∆, A, B, C, and D the state space parameters and compute
the `contextualized_states`. A, D are input independent
(see Mamba paper [1] Section 3.5.2 "Interpretation of A"
for why A isn't selective) ∆, B, C are input-dependent
(this is a key difference between Mamba and the linear time
invariant S4, and is why Mamba is called
**selective** state spaces)
"""
@CustomOp.register("mixer2_gated_rms_norm")
class Mixer2RMSNormGated(CustomOp):
@PluggableLayer.register("plamo2_mamba_mixer")
class Plamo2MambaMixer(MambaBase, PluggableLayer):
@CustomOp.register("short_conv")
class ShortConv(MambaBase, CustomOp):
8. MoE
@CustomOp.register("fused_moe")
class FusedMoE(CustomOp):
"""FusedMoE layer for MoE models.
This layer contains both MergedColumnParallel weights (gate_up_proj /
w13) and RowParallelLinear weights (down_proj/ w2).
Note: Mixtral uses w1, w2, and w3 for gate, up, and down_proj. We
copy that naming convention here and handle any remapping in the
load_weights function in each model implementation.
Args:
num_experts: Number of experts in the model
top_k: Number of experts selected for each token
hidden_size: Input hidden state size of the transformer
intermediate_size: Intermediate size of the experts
params_dtype: Data type for the parameters.
reduce_results: Whether to all_reduce on the output of the layer
renormalize: Whether to renormalize the logits in the fused_moe kernel
quant_config: Quantization configure.
enable_eplb: Whether to enable expert parallelism load balancer.
router_logits_dtype: Data type for router logits buffers.
"""
@CustomOp.register("modular_fused_moe")
class FusedMoEModularMethod(FusedMoEMethodBase, CustomOp):
@CustomOp.register("unquantized_fused_moe")
class UnquantizedFusedMoEMethod(FusedMoEMethodBase, CustomOp):
"""MoE method without quantization."""
@CustomOp.register("transformers_fused_moe")
class TransformersFusedMoE(FusedMoE):
"""Custom FusedMoE for the Transformers modeling backend."""
9. Norm(归一化)
@CustomOp.register("rms_norm")
class RMSNorm(CustomOp):
"""Root mean square normalization.
Computes x -> w * x / sqrt(E[x^2] + eps) where w is the learned weight.
Refer to https://arxiv.org/abs/1910.07467
"""
@CustomOp.register("rms_norm_gated")
class RMSNormGated(CustomOp):
"""RMS Normalization with optional gating.
This is a native PyTorch implementation that supports:
- Standard RMS normalization
- Group RMS normalization
- Optional gating with SiLU activation
"""
@CustomOp.register("gemma_rms_norm")
class GemmaRMSNorm(CustomOp):
"""RMS normalization for Gemma.
Two differences from the above RMSNorm:
1. x * (1 + w) instead of x * w.
2. (x * w).to(orig_dtype) instead of x.to(orig_dtype) * w.
"""
10. Quantization(量化)
@CustomOp.register("quant_fp8")
class QuantFP8(CustomOp):
"""
Quantize input tensor to FP8 (per-tensor, per-token, per-channel, or per-group).
This CustomOp supports both static and dynamic quantization.
"""
11. Rope(旋转位置编码)
@CustomOp.register("rotary_embedding")
class RotaryEmbeddingBase(CustomOp):
"""Original rotary positional embedding."""
@CustomOp.register("dual_chunk_rotary_embedding")
class DualChunkRotaryEmbedding(CustomOp):
"""Rotary positional embedding for Dual Chunk Attention."""
@CustomOp.register("apply_rotary_emb")
class ApplyRotaryEmb(CustomOp):
12. Encoder(编码器)
@PluggableLayer.register("qwen2_decoder")
class CustomQwen2Decoder(PluggableLayer):
"""
Qwen2 visual encoder
non-causal attention + causal attention
token_type_ids :0=non-causal, 1=causal
"""
@CustomOp.register("mm_encoder_attn")
class MMEncoderAttention(CustomOp):
"""Multi-headed attention without any cache, used for multimodal encoder."""
@PluggableLayer.register("rel_pos_attention")
class RelPosAttention(PluggableLayer):
"""Multi-head Attention block with relative position embeddings."""
实现新 CustomOp 的指南¶
在 vLLM 中实现新的 CustomOp¶
本部分是关于如何在 vLLM 中实现新的 CustomOp 的教程。
步骤
- 实现一个新的操作类,该类继承自
CustomOp基类。 - 在该操作类上添加
@CustomOp.register("op_name")装饰器,将其注册到CustomOp系统中。 - 根据您的需求实现不同的
forward_xxx()方法。
以 MMEncoderAttention 为例
代码
@CustomOp.register("mm_encoder_attn")
class MMEncoderAttention(CustomOp):
def __init__(
self,
num_heads: int,
head_size: int,
scale: float | None = None,
num_kv_heads: int | None = None,
prefix: str = "",
multimodal_config: MultiModalConfig | None = None,
) -> None:
super().__init__()
# Init...
def forward_native(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
cu_seqlens: torch.Tensor | None = None,
max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention
) -> torch.Tensor:
# Call TORCH_SDPA implementation...
def forward_cuda(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
cu_seqlens: torch.Tensor | None = None,
max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention
) -> torch.Tensor:
# Call FA or TORCH_SDPA implementation...
def forward_cpu(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
cu_seqlens: torch.Tensor | None = None,
max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention
) -> torch.Tensor:
# Call TORCH_SDPA implementation...
def forward_xpu(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
cu_seqlens: torch.Tensor | None = None,
max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention
) -> torch.Tensor:
# Call FA implementation...
def forward_tpu(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
cu_seqlens: torch.Tensor | None = None,
max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention
) -> torch.Tensor:
# Call PALLAS implementation...
在 OOT 设备插件中注册新的 CustomOp¶
目前,得益于 vLLM 的硬件插件机制,涌现出了各种 OOT 设备插件,使 vLLM 能够无缝运行在不同的硬件上。您还可以通过 Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU 了解有关此机制的更多详情。
- 官方设备插件: vllm-ascend (华为 Ascend NPU), vllm-spyre (Spyre), vllm-gaudi (Intel Gaudi), vllm-neuron (AWS Neuron), vllm-meta (Apple Silicon) 等。
- 非官方设备插件: vllm-metax (MetaX GPU), vllm-kunlun (百度昆仑 XPU), vllm-musa (摩尔线程 GPU) 等。
在这种情况下,CustomOp 可以使这些硬件制造商通过注册 OOT CustomOp 并实现 forward_oot() 方法,在运行时无缝地用其特定设备的深度优化内核替换 vLLM 的操作。
现在,本节将向您展示如何为设备插件注册 OOT CustomOp。
以 MMEncoderAttention 为例
- 实现一个继承自
MMEncoderAttention的CustomMMEncoderAttention类,并实现其forward_oot()方法。 - 将您的
CustomMMEncoderAttention注册到 vLLM 中以替换MMEncoderAttention。
代码
from vllm.model_executor.layers.attention import MMEncoderAttention
from vllm.model_executor.custom_op import CustomOp
@CustomOp.register_oot("MMEncoderAttention")
class CustomMMEncoderAttention(MMEncoderAttention):
def __init__(...):
super().__init__(...)
def forward_oot(...):
# Call optimized device-specific kernels.
...
在这种情况下,一个新条目 {"MMEncoderAttention": CustomMMEncoderAttention} 将被添加到 op_registry_oot 中。当初始化 MMEncoderAttention 操作对象时,如果类名(即 MMEncoderAttention)包含在 op_registry_oot 的键中,vLLM 将使用我们注册的类(即 CustomMMEncoderAttention)来替换并实例化它。
此后,当调用此 MMEncoderAttention 操作时,如果已启用,将会调用您的 forward_oot()。因此,您可以在无需直接修改 vLLM 的情况下,在您的硬件上获得预期的性能。
此外,您也可以将所有的 CustomOp 注册在一个地方,以便于管理。