Fused MoE Kernel Features¶

本文档旨在概述各种 MoE 内核（包括模块化和非模块化内核），以便更容易地为特定情况选择合适的内核集。这包括有关模块化内核使用的 all2all 后端的信息。

Fused MoE Modular All2All backends¶

有许多 all2all 通信后端用于实现 FusedMoE 层的专家并行 (EP)。不同的 FusedMoEPrepareAndFinalize 子类为每个 all2all 后端提供了一个接口。

下表描述了每个后端的相关功能，即激活格式、支持的量化方案和异步支持。

输出激活格式（标准或批量）对应于 FusedMoEPrepareAndFinalize 子类的 prepare 步骤的输出，并且 finalize 步骤需要相同的格式。所有后端 prepare 方法都期望标准格式的激活，所有 finalize 方法都返回标准格式的激活。有关格式的更多详细信息，请参阅 Fused MoE Modular Kernel 文档。

量化类型和格式列出了每个 FusedMoEPrepareAndFinalize 类支持的量化方案。量化可以在分派之前或之后发生，具体取决于 all2all 后端支持的格式，例如，deepep_high_throughput 仅支持块量化的 fp8 格式。任何其他格式将导致以更高精度分派并在之后进行量化。每个后端 prepare 步骤的输出是量化类型。finalize 步骤通常需要与原始激活相同的输入类型，例如，如果原始输入是 bfloat16 并且量化方案是具有每张量的 fp8 尺度，则 prepare 将返回 fp8/每张量尺度激活，而 finalize 将接受 bfloat16 激活。有关 MoE 过程每个步骤激活的类型和格式的更多详细信息，请参阅 Fused MoE Modular Kernel 中的图表。如果未指定量化类型，则内核在 float16 和/或 bfloat16 上运行。

异步后端支持 DBO（双批次重叠）和共享专家重叠（在 combine 步骤中计算共享专家）。

某些模型要求将 topk 权重应用于输入激活而不是输出激活（当 topk==1 时），例如 Llama。对于模块化内核，此功能由 FusedMoEPrepareAndFinalize 子类支持。对于非模块化内核，专家函数负责处理此标志。

除非另有说明，否则后端通过 --all2all-backend 命令行参数（或 ParallelConfig 中的 all2all_backend 参数）进行控制。除了 flashinfer 之外，所有后端仅与 EP+DP 或 EP+TP 一起工作。 Flashinfer 可以与 EP 或 DP（无 EP）一起工作。

Backend	Output act. format	Quant. types	Quant. format	Async	Apply Weight On Input	Subclass
naive	standard	all¹	G,A,T	N	⁶	layer.py
pplx	batched	fp8,int8	G,A,T	Y	Y	`PplxPrepareAndFinalize`
deepep_high_throughput	standard	fp8	G(128),A,T²	Y	Y	`DeepEPLLPrepareAndFinalize`
deepep_low_latency	batched	fp8	G(128),A,T³	Y	Y	`DeepEPHTPrepareAndFinalize`
flashinfer_all2allv	standard	nvfp4,fp8	G,A,T	N	N	`FlashInferAllToAllMoEPrepareAndFinalize`
flashinfer⁴	standard	nvfp4,fp8	G,A,T	N	N	`FlashInferCutlassMoEPrepareAndFinalize`
MoEPrepareAndFinalizeNoEP⁵	standard	fp8,int8	G,A,T	N	Y	`MoEPrepareAndFinalizeNoEP`
BatchedPrepareAndFinalize⁵	batched	fp8,int8	G,A,T	N	Y	`BatchedPrepareAndFinalize`

Table key

All types: mxfp4, nvfp4, int4, int8, fp8
A,T quantization occurs after dispatch.
All quantization happens after dispatch.
Controlled by different env vars (VLLM_FLASHINFER_MOE_BACKEND "throughput" or "latency")
This is a no-op dispatcher that can be used to pair with any modular experts to produce a modular kernel that runs without dispatch or combine. These cannot be selected via environment variable. These are generally use for testing or adapting an expert subclass to the fused_experts API.
This depends on the experts implementation.

G - Grouped
G(N) - Grouped w/block size N
A - Per activation token
T - Per tensor

Modular kernels are supported by the following FusedMoEMethodBase classes.

Fused Experts Kernels¶

有多种 MoE experts 内核实现，适用于不同的量化类型和架构。大多数遵循 Triton 的基本 API fused_experts 函数。许多具有模块化内核适配器，因此可以与兼容的 all2all 后端一起使用。下表列出了每个 experts 内核及其特定属性。

每个内核都必须提供一种支持的输入激活格式。某些内核类型支持标准格式和批量格式，通过不同的入口点，例如 TritonExperts 和 BatchedTritonExperts。批量格式内核目前仅用于匹配某些 all2all 后端，例如 pplx 和 DeepEPLLPrepareAndFinalize。

与后端内核类似，每个 experts 内核仅支持特定的量化格式。对于非模块化专家，激活将采用原始类型并在内核内部进行量化。模块化专家将期望激活已采用量化格式。两种类型的专家都将产生原始激活类型的输出。

每个 experts 内核支持一种或多种激活函数，例如 silu 或 gelu，这些函数应用于中间结果。

与后端一样，一些专家支持将 topk 权重应用于输入激活。此表中该列的条目仅适用于非模块化专家。

大多数专家类型都包含等效的模块化接口，该接口将是 FusedMoEPermuteExpertsUnpermute 的子类。

为了与特定的 FusedMoEPrepareAndFinalize 子类一起使用，MoE 内核必须具有兼容的激活格式、量化类型和量化格式。

Kernel	Input act. format	Quant. types	Quant. format	Activation function	Apply Weight On Input	Modular	Source
triton	standard	all¹	G,A,T	silu, gelu, swigluoai, silu_no_mul, gelu_no_mul	Y	Y	`fused_experts`, `TritonExperts`
triton (batched)	batched	all¹	G,A,T	silu, gelu	⁶	Y	`BatchedTritonExperts`
deep gemm	standard, batched	fp8	G(128),A,T	silu, gelu	⁶	Y	`deep_gemm_moe_fp8`, `DeepGemmExperts`, `BatchedDeepGemmExperts`
cutlass_fp4	standard, batched	nvfp4	A,T	silu	Y	Y	`cutlass_moe_fp4`, `CutlassExpertsFp4`
cutlass_fp8	standard, batched	fp8	A,T	silu, gelu	Y	Y	`cutlass_moe_fp8`, `CutlassExpertsFp8`, `CutlasBatchedExpertsFp8`
flashinfer	standard	nvfp4, fp8	T	⁵	N	Y	`flashinfer_cutlass_moe_fp4`, `FlashInferExperts`
gpt oss triton	standard	不适用	不适用	⁵	Y	Y	`triton_kernel_fused_experts`, `OAITritonExperts`
marlin	standard, batched	³ / N/A	³ / N/A	silu, swigluoai	Y	Y	`fused_marlin_moe`, `MarlinExperts`, `BatchedMarlinExperts`
trtllm	standard	mxfp4, nvfp4	G(16),G(32)	⁵	N	Y	`TrtLlmGenExperts`
pallas	standard	不适用	不适用	silu	N	N	`fused_moe`
iterative	standard	不适用	不适用	silu	N	N	`fused_moe`
rocm aiter moe	standard	fp8	G(128),A,T	silu, gelu	Y	N	`rocm_aiter_fused_experts`
cpu_fused_moe	standard	不适用	不适用	silu	N	N	`CPUFusedMOE`
naive batched⁴	batched	int8, fp8	G,A,T	silu, gelu	⁶	Y	`NaiveBatchedExperts`

Table key

All types: mxfp4, nvfp4, int4, int8, fp8
A dispatcher wrapper around triton and deep gemm experts. Will select based on type + shape + quantization params
uint4, uint8, fp8, fp4
This is a naive implementation of experts that supports batched format. Mainly used for testing.
The activation parameter is ignored and SwiGlu is used by default instead.
Only handled by or supported when used with modular kernels.

Modular Kernel "families"¶

下表显示了旨在协同工作的模块化内核的“系列”。有些组合可能有效但尚未经过测试，例如 flashinfer 与其他 fp8 专家。请注意，“naive”后端可以与任何非模块化专家一起使用。

backend	`FusedMoEPrepareAndFinalize` subclasses	`FusedMoEPermuteExpertsUnpermute` subclasses
deepep_high_throughput	`DeepEPHTPrepareAndFinalize`	`DeepGemmExperts`, `TritonExperts`, `TritonOrDeepGemmExperts`, `CutlassExpertsFp8`, `MarlinExperts`
deepep_low_latency, pplx	`DeepEPLLPrepareAndFinalize`, `PplxPrepareAndFinalize`	`BatchedDeepGemmExperts`, `BatchedTritonExperts`, `CutlassBatchedExpertsFp8`, `BatchedMarlinExperts`
flashinfer	`FlashInferCutlassMoEPrepareAndFinalize`	`FlashInferExperts`