TeaCache 配置指南¶

TeaCache 通过缓存 Transformer 计算来加速扩散模型推理，当连续的时间步长相近时，通常能提供 1.5x-2.0x 的加速，且几乎没有质量损失。

快速入门¶

通过将 cache_backend 设置为 "tea_cache" 来启用 TeaCache

from vllm_omni import Omni

# Simple configuration - model_type is automatically extracted from pipeline.__class__.__name__
omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="tea_cache",
    cache_config={
        "rel_l1_thresh": 0.2  # Optional, defaults to 0.2
    }
)
outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50)

使用环境变量¶

您也可以通过环境变量启用 TeaCache

export DIFFUSION_CACHE_BACKEND=tea_cache

然后初始化时无需显式设置 cache_backend

from vllm_omni import Omni

omni = Omni(
    model="Qwen/Qwen-Image",
    cache_config={"rel_l1_thresh": 0.2}  # Optional
)

在线服务 (兼容 OpenAI)¶

在启动服务器时，通过传递 --cache-backend tea_cache 来为在线服务启用 TeaCache

vllm serve Qwen/Qwen-Image --omni --port 8091 \
  --cache-backend tea_cache \
  --cache-config '{"rel_l1_thresh": 0.2}'

配置参数¶

`rel_l1_thresh` (float, 默认值: `0.2`)¶

控制速度和质量的平衡。值越低，优先保证质量；值越高，优先保证速度。

推荐值

0.2 - 约 1.5x 加速，几乎无质量损失 (推荐)
0.4 - 约 1.8x 加速，有轻微质量损失
0.6 - 约 2.0x 加速，有明显质量损失
0.8 - 约 2.25x 加速，有显著质量损失

示例¶

Python API¶

from vllm_omni import Omni

omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="tea_cache",
    cache_config={"rel_l1_thresh": 0.2}
)
outputs = omni.generate(prompt="A cat sitting on a windowsill", num_inference_steps=50)

性能调优¶

从默认值 rel_l1_thresh=0.2 开始，根据您的需求进行调整

最高质量：使用 0.1-0.2
平衡：使用 0.2-0.4 (推荐)
最高速度：使用 0.6-0.8 (可能会降低质量)

故障排除¶

质量下降¶

如果您发现质量问题，请降低阈值

cache_config={"rel_l1_thresh": 0.1}  # More conservative caching

支持的模型¶

架构	模型	示例 HF 模型
`QwenImagePipeline`	Qwen-Image	`Qwen/Qwen-Image`
`QwenImageEditPipeline`	Qwen-Image-Edit	`Qwen/Qwen-Image-Edit`
`QwenImageEditPlusPipeline`	Qwen-Image-Edit	`Qwen/Qwen-Image-Edit-2509`

即将推出¶

架构	模型	示例 HF 模型
`FluxPipeline`	Flux	-
`CogVideoXPipeline`	CogVideoX	-