继承自:Modifier, QuantizationMixin
为给定模块或其子模块启用训练后量化(PTQ)和量化感知训练(QAT)。在校准(PTQ)或起始 epoch(QAT)之后,指定模块的前向传播将模拟量化执行,并且修饰符将一直启用,直到训练完成。
参数
-
config_groups
– 字典,指定要应用于目标模块的量化方案。不符合方案目标的模块将不被量化。
-
targets
– 如果提供了方案,则为要量化的层名称列表。默认为线性层。
-
ignore
– 可选的模块类名或子模块名列表,即使它们与 config_groups 中的目标匹配,也不进行量化。默认为空列表。
-
scheme
– 应用于模型的单个量化方案。这是一个字典,支持 QuantizationScheme 的所有键,除了 targets,它将被设置为修饰符级别设置的 targets 参数。也可以设置为 preset_scheme_name: targets 格式的字典,例如:W8A8: ['Linear'] 用于权重和激活的 8 位量化。
-
kv_cache_scheme
– 可选的 QuantizationArgs,用于指定 kv 缓存的量化。如果为 None,则不量化 kv 缓存。当将 kv 缓存量化应用于 transformer AutoModelForCausalLM 时,kv_cache_scheme 会被转换为一个 QuantizationScheme,该 Scheme: - 目标模型中的 q_proj 和 k_proj 模块。这些模块的输出是可能被缓存的键和值 - 对上述层的输出进行量化,以便在将键和值存储到缓存之前对其进行压缩。有一个明确的假设是模型包含名称中带有 k_proj 和 v_proj 的模块。如果不是这种情况,并且 kv_cache_scheme != None,则 kv 缓存的量化将失败。
方法
on_end
on_end(state: State, event: Event, **kwargs)
通过移除观察者和校准钩子来完成校准
源代码在 llmcompressor/modifiers/quantization/quantization/base.py
| def on_end(self, state: State, event: Event, **kwargs):
"""
Finish calibrating by removing observers and calibration hooks
"""
self.ended_ = True
QuantizationMixin.end_calibration(
self, state.model
) # keep quantization enabled
|
on_initialize
on_initialize(state: State, **kwargs) -> bool
准备校准激活和权重
根据量化配置,量化方案被附加到每个目标模块。模块的前向调用也被重写,以对输入、权重和输出执行量化。
然后,根据模块的量化方案,添加观察者和校准钩子。这些钩子在修饰符启动之前是禁用的。
源代码在 llmcompressor/modifiers/quantization/quantization/base.py
| def on_initialize(self, state: State, **kwargs) -> bool:
"""
Prepare to calibrate activations and weights
According to the quantization config, a quantization scheme is attached to each
targeted module. The module's forward call is also overwritten to perform
quantization to inputs, weights, and outputs.
Then, according to the module's quantization scheme, observers and calibration
hooks are added. These hooks are disabled until the modifier starts.
"""
if not QuantizationMixin.has_config(self):
raise ValueError(
"QuantizationModifier requires that quantization fields be specified"
)
QuantizationMixin.initialize_quantization(self, state.model)
return True
|
on_start
on_start(state: State, event: Event, **kwargs)
开始校准激活和权重。在开始时只校准权重一次
源代码在 llmcompressor/modifiers/quantization/quantization/base.py
| def on_start(self, state: State, event: Event, **kwargs):
"""
Begin calibrating activations and weights. Calibrate weights only once on start
"""
self.started_ = True
QuantizationMixin.start_calibration(self, state.model)
named_modules = list(
match_named_modules(state.model, self.resolved_targets, self.ignore)
)
# TODO: this step can be combined with update_weight_zp_scale
# once update_fused_layer_weight_global_scales is removed
# and not required by vLLM
for _, module in tqdm.tqdm(named_modules, desc="Updating global scales"):
update_weight_global_scale(module)
# NOTE: update_fused_layer_weight_global_scales operates on Attention
# and MLP layers, not quantizable Linear layers. Rather than running
# on targeted modules, we need to run on all modules.
# Because this call is idempotent, setting all global_scales to the
# min value, it is ok to run potentially multiple times for all modules
for module in tqdm.tqdm(state.model.modules(), desc="Fusing global scales"):
update_fused_layer_weight_global_scales(module)
for _, module in tqdm.tqdm(named_modules, desc="Calibrating weights"):
update_weight_zp_scale(module)
|