llmcompressor.modifiers.quantization

模块

calibration –
gptq –
quantization –

类

GPTQModifier –

实现 https://arxiv.org/abs/2210.17323 中的 GPTQ 算法。此修饰符
QuantizationMixin –

Mixin，使 Modifier 能够充当量化配置，附加观察者，
QuantizationModifier –

为给定模块或其子模块启用训练后量化（PTQ）和量化感知训练（QAT）

GPTQModifier

基类：Modifier, QuantizationMixin

实现了来自 https://arxiv.org/abs/2210.17323 的 GPTQ 算法。此 Modifier 使用激活来校准 Hessian 矩阵，然后该矩阵用于确定模型权重的最佳量化值和顺序。

示例 yaml

test_stage:
  obcq_modifiers:
    GPTQModifier:
      block_size: 128
      dampening_frac: 0.001
      offload_hessians: False
      actorder: static
      config_groups:
        group_0:
          targets:
            - "Linear"
          input_activations: null
          output_activations: null
          weights:
            num_bits: 8
            type: "int"
            symmetric: true
            strategy: group
            group_size: 128

生命周期

on_initialize
- 将配置应用于模型
on_start
- 添加激活校准钩子
- 添加 gptq 权重校准钩子
on_sequential_epoch_end
- quantize_weight
on_finalize
- remove_hooks()
- model.apply(freeze_module_quantization)

参数

sequential_targets
–

GPTQ 期间要压缩的层名称列表，或 'ALL' 以压缩模型中的每一层
块大小
–

用于确定一次通过中要压缩的列数
阻尼分数
–

施加到 H 的阻尼量，作为对角范数的百分比
actorder
–

权重列的量化顺序。默认为“静态”激活顺序，可实现最佳精度恢复，且无运行时开销。更多信息请参见 https://github.com/vllm-project/vllm/pull/8135
卸载Hessian矩阵
–

设置为 True 可减少内存使用但增加运行时。
config_groups
–

字典，指定要应用于目标模块的量化方案。不符合方案目标的模块将不被量化。
targets
–

如果提供了方案，则为要量化的层名称列表。默认为线性层。
ignore
–

可选的模块类名或子模块名列表，即使它们与 config_groups 中的目标匹配，也不进行量化。默认为空列表。
scheme
–

应用于模型的单个量化方案。这是一个字典，支持 QuantizationScheme 的所有键，除了 targets，它将被设置为修饰符级别设置的 targets 参数。也可以设置为 preset_scheme_name: targets 格式的字典，例如：W8A8: ['Linear'] 用于权重和激活的 8 位量化。
kv_cache_scheme
–

可选的 QuantizationArgs，用于指定 kv 缓存的量化。如果为 None，则不量化 kv 缓存。当将 kv 缓存量化应用于 transformer AutoModelForCausalLM 时，kv_cache_scheme 会被转换为一个 QuantizationScheme，该 Scheme： - 目标模型中的 q_proj 和 k_proj 模块。这些模块的输出是可能被缓存的键和值 - 对上述层的输出进行量化，以便在将键和值存储到缓存之前对其进行压缩。有一个明确的假设是模型包含名称中带有 k_proj 和 v_proj 的模块。如果不是这种情况，并且 kv_cache_scheme != None，则 kv 缓存的量化将失败。

方法

calibrate_module –

校准钩子，用于累积模块输入的 Hessian 矩阵
compress_modules –

量化已校准的模块
on_end –

通过移除观察者和校准钩子来完成校准
on_finalize –

禁用 OBCQ 算法使用的量化观察器
on_initialize –

在当前状态下初始化并运行 GPTQ 算法

calibrate_module

calibrate_module(
    module: Module,
    args: Tuple[Tensor, ...],
    _output: Tensor,
)

校准钩子，用于累积模块输入的 Hessian 矩阵

参数

module
(Module) –

正在校准的模块
args
(Tuple[Tensor, ...]) –

模块的输入，其中第一个元素是规范输入
_输出
(Tensor) –

未压缩的模块输出，未使用

源代码位于 llmcompressor/modifiers/quantization/gptq/base.py

def calibrate_module(
    self,
    module: torch.nn.Module,
    args: Tuple[torch.Tensor, ...],
    _output: torch.Tensor,
):
    """
    Calibration hook used to accumulate the hessian of the input to the module

    :param module: module being calibrated
    :param args: inputs to the module, the first element of which is the
        canonical input
    :param _output: uncompressed module output, unused
    """
    # Assume that first argument is the input
    inp = args[0]

    # Initialize hessian if not present
    if module not in self._num_samples:
        init_device = (
            "cpu" if self.offload_hessians else get_execution_device(module)
        )
        self._hessians[module] = make_empty_hessian(module, device=init_device)
        self._num_samples[module] = 0

    # Accumulate hessian with input with optional offloading
    with self._maybe_onload_hessian(module):
        self._hessians[module], self._num_samples[module] = accumulate_hessian(
            inp,
            module,
            self._hessians[module],
            self._num_samples[module],
        )

compress_modules

compress_modules()

量化已校准的模块

源代码位于 llmcompressor/modifiers/quantization/gptq/base.py

def compress_modules(self):
    """
    Quantize modules which have been calibrated
    """
    for module in list(self._num_samples.keys()):
        name = self._module_names[module]
        num_samples = self._num_samples[module]
        quant_args = getattr_chain(module, "quantization_scheme.weights")

        logger.info(f"Quantizing {name} using {num_samples} samples")
        with torch.no_grad(), align_module_device(
            module
        ), self._maybe_onload_hessian(module), CompressionLogger(
            module
        ) as comp_logger:
            loss, quantized_weight, scale, zero_point, g_idx = quantize_weight(
                module=module,
                quant_args=quant_args,
                hessians_dict=self._hessians,
                blocksize=self.block_size,
                percdamp=self.dampening_frac,
            )
            comp_logger.set_loss(loss)

        update_offload_parameter(module, "weight", quantized_weight)
        update_offload_parameter(module, "weight_scale", scale)
        update_offload_parameter(module, "weight_zero_point", zero_point)
        if g_idx is not None:
            update_offload_parameter(module, "weight_g_idx", g_idx)

        # self._hessians[module] already deleted by quantize_weight
        del self._num_samples[module]

on_end

on_end(state: State, event: Event, **kwargs)

通过移除观察者和校准钩子来完成校准

源代码位于 llmcompressor/modifiers/quantization/gptq/base.py

def on_end(self, state: State, event: Event, **kwargs):
    """
    Finish calibrating by removing observers and calibration hooks
    """
    self.ended_ = True
    QuantizationMixin.end_calibration(self, state.model)
    self.remove_hooks()  # remove gptq hooks

on_finalize

on_finalize(state: State, **kwargs) -> bool

禁用 OBCQ 算法使用的量化观察器

参数

state
(State) –

存储输入模型和校准数据的会话状态

源代码位于 llmcompressor/modifiers/quantization/gptq/base.py

def on_finalize(self, state: State, **kwargs) -> bool:
    """
    disable the quantization observers used by the OBCQ algorithm

    :param state: session state storing input model and calibration data
    """
    if not self.ended_:
        self.on_end(state, None)

    if len(self._num_samples) > 0:
        raise ValueError(f"Failed to compress {len(self._num_samples)} modules")

    self._hessians = dict()
    self._num_samples = dict()

    return True

on_initialize

on_initialize(state: State, **kwargs) -> bool

在当前状态下初始化并运行 GPTQ 算法

参数

state
(State) –

存储输入模型和校准数据的会话状态

源代码位于 llmcompressor/modifiers/quantization/gptq/base.py

def on_initialize(self, state: State, **kwargs) -> bool:
    """
    Initialize and run the GPTQ algorithm on the current state

    :param state: session state storing input model and calibration data
    """
    # apply config to model and prepare calibration hooks
    if QuantizationMixin.has_config(self):
        QuantizationMixin.initialize_quantization(self, state.model)

    # prepare module names
    self._module_names = {
        m: name
        for name, m in match_named_modules(
            state.model, self.resolved_targets, self.ignore
        )
    }

    return True

QuantizationMixin

基类：HooksMixin

Mixin，使 Modifier 能够充当量化配置，为 modifier 附加观察者、校准钩子和压缩包装器。

生命周期

on_initialize: QuantizationMixin.initialize_quantization
- 将 scheme 附加到模块
- 将观察者附加到模块
- 禁用量化，直到校准开始/完成
on_start: QuantizationMixin.start_calibration
- 附加校准钩子
- 应用校准状态
- 在校准期间启用量化
on_end: QuantizationMixin.end_calibration
- 移除校准钩子
- 应用冻结状态
- 保持量化启用以供将来步骤使用

注意：QuantizationMixin 不会自动更新 scale 和 zero-point，因为并非所有继承自它的 Modifier 都需要这样做。Modifier 必须显式调用 update_weight_zp_scale。请参阅 QuantizationModifier.on_start 方法作为示例。

参数

config_groups
–

字典，指定要应用于目标模块的量化方案。不符合方案目标的模块将不被量化。
targets
–

如果提供了方案，则是要量化的层名称列表。如果未设置，将包含 config_groups 中列出的所有目标。如果 config_groups 也未设置，则默认为 ["Linear"]（即所有 Linear 层都将被作为目标）。此字段不是在模型中查找所有匹配目标层的真实来源。额外信息可以存储在 config_groups 中。请改用 self.resolved_targets。
ignore
–

可选的模块类名或子模块名列表，即使它们与 config_groups 中的目标匹配，也不进行量化。默认为空列表。
scheme
–

应用于模型的单个量化方案。这是一个字典，支持 QuantizationScheme 的所有键，除了 targets，它将被设置为修饰符级别设置的 targets 参数。也可以设置为 preset_scheme_name: targets 格式的字典，例如：W8A8: ['Linear'] 用于权重和激活的 8 位量化。
kv_cache_scheme
–

可选的 QuantizationArgs，用于指定 kv 缓存的量化。如果为 None，则不量化 kv 缓存。当将 kv 缓存量化应用于 transformer AutoModelForCausalLM 时，kv_cache_scheme 会被转换为一个 QuantizationScheme，该 Scheme： - 目标模型中的 q_proj 和 k_proj 模块。这些模块的输出是可能被缓存的键和值 - 对上述层的输出进行量化，以便在将键和值存储到缓存之前对其进行压缩。有一个明确的假设是模型包含名称中带有 k_proj 和 v_proj 的模块。如果不是这种情况，并且 kv_cache_scheme != None，则 kv 缓存的量化将失败。
weight_observer
–

权重量化的可选观察者名称。覆盖 scheme 中指定的默认观察者。有效值包括 "minmax", "mse", "static_minmax", "memoryless_minmax", "memoryless_mse"。
input_observer
–

输入激活量化的可选观察者名称。覆盖 scheme 中指定的默认观察者。有效值包括 "minmax", "mse", "static_minmax", "memoryless_minmax", "memoryless_mse"。
output_observer
–

输出激活量化的可选观察者名称。覆盖 scheme 中指定的默认观察者。有效值包括 "minmax", "mse", "static_minmax", "memoryless_minmax", "memoryless_mse"。
observer
–

可选字典，用于一次指定多种量化类型的观察者。键可以是 "weights", "input" 或 "output"。值是观察者名称。示例：{"weights": "MSE", "input": "MSE"}。如果同时提供了单个观察者参数 (weight_observer, input_observer, output_observer) 和 observer 字典，则 observer 字典优先。

方法

end_calibration –

移除校准钩子和观察器，并将模型状态设置为冻结。
has_config –

确定用户是否在此修饰符上指定了量化配置
initialize_quantization –

根据模型将量化方案附加到模块
resolve_quantization_config –

返回此修饰符指定的量化配置
start_calibration –

附加观察器，注册激活校准钩子（包括
validate_observer –

验证 observer 字典格式。接受的键：'weights', 'input', 'output'

属性

resolved_config (QuantizationConfig) –

量化配置需要根据
resolved_targets (Set[str]) –

所有已解析目标的集合，即列出的所有唯一目标

resolved_config `property`

resolved_config: QuantizationConfig

量化配置只需根据方案和 config_groups 输入解析一次。

resolved_targets `property`

resolved_targets: Set[str]

所有已解析目标的集合，即已解析量化配置中列出的所有唯一目标。请使用此属性而不是 targets 字段，因为 targets 也可以根据配方配置方式来自 config_groups。

end_calibration

end_calibration(model: Module)

移除校准钩子和观察器，并将模型状态设置为冻结。保持量化启用以进行未来的操作

参数

model
(Module) –

模型结束校准

源代码在 llmcompressor/modifiers/quantization/quantization/mixin.py 中

def end_calibration(self, model: torch.nn.Module):
    """
    Remove calibration hooks and observers, and set the model status to frozen.
    Keep quantization enabled for future operations

    :param model: model to end calibration for
    """
    self.remove_hooks(self._calibration_hooks)
    for _, module in match_named_modules(model, self.resolved_targets, self.ignore):
        freeze_module_quantization(module)  # remove observers

    model.apply(enable_quantization)  # keep quantization enabled

has_config

has_config() -> bool

确定用户是否在此修饰符上指定了量化配置

源代码在 llmcompressor/modifiers/quantization/quantization/mixin.py 中

def has_config(self) -> bool:
    """
    Determine if the user has specified a quantization config on this modifier
    """
    return not (
        self.config_groups is None
        and self.targets == ["Linear"]
        and self.ignore == []
        and self.scheme is None
        and self.kv_cache_scheme is None
    )

initialize_quantization

initialize_quantization(model: Module)

根据此修饰符上指定的量化配置将量化方案附加到模型中的模块

参数

model
(Module) –

要附加方案和观察器的模型

源代码在 llmcompressor/modifiers/quantization/quantization/mixin.py 中

def initialize_quantization(self, model: torch.nn.Module):
    """
    Attach quantization schemes to modules in the model according to
    the quantization config specified on this modifier

    :param model: model to attach schemes and observers to
    """

    for _, module in match_named_modules(model, self.resolved_targets, self.ignore):
        reset_quantization_status(module)  # reset any previously applied qconfigs

    apply_quantization_config(model, self.resolved_config)

    # disable quantization until calibration
    model.apply(disable_quantization)

resolve_quantization_config

resolve_quantization_config() -> QuantizationConfig

返回此修饰符指定的量化配置

源代码在 llmcompressor/modifiers/quantization/quantization/mixin.py 中

def resolve_quantization_config(self) -> QuantizationConfig:
    """
    Returns the quantization config specified by this modifier
    """
    scheme = self.scheme
    targets = self.targets
    config_groups = self.config_groups
    kv_cache_scheme = self.kv_cache_scheme
    ignore = self.ignore

    if scheme is not None and config_groups is not None:
        raise ValueError("Please specify either `scheme` or `config_groups`")

    if scheme is not None:
        # takes precedence over config_groups

        if isinstance(scheme, str) and is_preset_scheme(scheme):
            # attach targets to scheme
            scheme = {scheme: targets}

        config_groups = {}
        for idx, key in enumerate(scheme.keys()):
            if is_preset_scheme(key):
                scheme_obj = preset_name_to_scheme(key, scheme[key])
            else:
                scheme_obj = QuantizationScheme.model_validate(
                    {"targets": scheme[key], **scheme}
                )

            # Apply observer overrides if specified
            scheme_obj = self._apply_observer_overrides(scheme_obj)

            group_name = f"group_{idx}"
            config_groups[group_name] = scheme_obj

    if config_groups is None or len(config_groups) == 0:
        default_quant_scheme = QuantizationScheme(targets=targets)
        # Apply observer overrides to default scheme as well
        default_quant_scheme = self._apply_observer_overrides(default_quant_scheme)
        config_groups = {"group_0": default_quant_scheme}
    elif scheme is None:
        # Apply observer overrides to all config groups when config_groups
        # was provided directly (not derived from scheme)
        for scheme_obj in config_groups.values():
            self._apply_observer_overrides(scheme_obj)

    return QuantizationConfig(
        config_groups=config_groups,
        kv_cache_scheme=kv_cache_scheme,
        quantization_status=QuantizationStatus.INITIALIZED,
        ignore=ignore,
    )

start_calibration

start_calibration(model: Module)

附加观察器，注册激活校准钩子（包括 kv_cache 量化），并在校准时启用量化

参数

model
(Module) –

准备用于校准的模型

源代码在 llmcompressor/modifiers/quantization/quantization/mixin.py 中

def start_calibration(self, model: torch.nn.Module):
    """
    Attach observers, register activation calibration hooks (including
    kv_cache quantization) and enable quantization as we calibrate

    :param model: model to prepare for calibration
    """
    targets = match_named_modules(model, self.resolved_targets, self.ignore)
    if targets_embeddings(model, targets):
        untie_word_embeddings(model)

    for _, module in match_named_modules(model, self.resolved_targets, self.ignore):
        self._initialize_observers(module)
        self._calibration_hooks |= self._initialize_hooks(module)
        apply_calibration_status(module)

    model.apply(enable_quantization)  # quantize at the same time as calibrate

validate_observer

validate_observer(value: Any) -> Optional[Dict[str, str]]

验证 observer 字典格式。接受的键：'weights', 'input', 'output'

源代码在 llmcompressor/modifiers/quantization/quantization/mixin.py 中

@field_validator("observer", mode="before")
def validate_observer(cls, value: Any) -> Optional[Dict[str, str]]:
    """
    Validate observer dictionary format. Accepts keys: 'weights', 'input', 'output'
    """
    if value is None:
        return value

    if not isinstance(value, dict):
        raise ValueError("`observer` must be a dictionary")

    valid_keys = {"weights", "input", "output"}
    for key in value.keys():
        if key not in valid_keys:
            raise ValueError(
                f"Invalid observer key '{key}'. Valid keys are: {valid_keys}"
            )
        if not isinstance(value[key], str):
            raise ValueError(f"Observer value for '{key}' must be a string")

    return value

QuantizationModifier

基类：Modifier, QuantizationMixin

为给定模块或其子模块启用训练后量化（PTQ）和量化感知训练（QAT）。在校准（PTQ）或起始 epoch（QAT）之后，指定模块的前向传播将模拟量化执行，并且修饰符将一直启用，直到训练完成。

参数

config_groups
–

字典，指定要应用于目标模块的量化方案。不符合方案目标的模块将不被量化。
targets
–

如果提供了方案，则为要量化的层名称列表。默认为线性层。
ignore
–

可选的模块类名或子模块名列表，即使它们与 config_groups 中的目标匹配，也不进行量化。默认为空列表。
scheme
–

应用于模型的单个量化方案。这是一个字典，支持 QuantizationScheme 的所有键，除了 targets，它将被设置为修饰符级别设置的 targets 参数。也可以设置为 preset_scheme_name: targets 格式的字典，例如：W8A8: ['Linear'] 用于权重和激活的 8 位量化。
kv_cache_scheme
–

可选的 QuantizationArgs，用于指定 kv 缓存的量化。如果为 None，则不量化 kv 缓存。当将 kv 缓存量化应用于 transformer AutoModelForCausalLM 时，kv_cache_scheme 会被转换为一个 QuantizationScheme，该 Scheme： - 目标模型中的 q_proj 和 k_proj 模块。这些模块的输出是可能被缓存的键和值 - 对上述层的输出进行量化，以便在将键和值存储到缓存之前对其进行压缩。有一个明确的假设是模型包含名称中带有 k_proj 和 v_proj 的模块。如果不是这种情况，并且 kv_cache_scheme != None，则 kv 缓存的量化将失败。

方法

on_end –

通过移除观察者和校准钩子来完成校准
on_initialize –

准备校准激活和权重
on_start –

开始校准激活和权重。在开始时只校准权重一次

on_end

on_end(state: State, event: Event, **kwargs)

通过移除观察者和校准钩子来完成校准

源代码在 llmcompressor/modifiers/quantization/quantization/base.py

def on_end(self, state: State, event: Event, **kwargs):
    """
    Finish calibrating by removing observers and calibration hooks
    """
    self.ended_ = True
    QuantizationMixin.end_calibration(
        self, state.model
    )  # keep quantization enabled

on_initialize

on_initialize(state: State, **kwargs) -> bool

准备校准激活和权重

根据量化配置，量化方案被附加到每个目标模块。模块的前向调用也被重写，以对输入、权重和输出执行量化。

然后，根据模块的量化方案，添加观察者和校准钩子。这些钩子在修饰符启动之前是禁用的。

源代码在 llmcompressor/modifiers/quantization/quantization/base.py

def on_initialize(self, state: State, **kwargs) -> bool:
    """
    Prepare to calibrate activations and weights

    According to the quantization config, a quantization scheme is attached to each
    targeted module. The module's forward call is also overwritten to perform
    quantization to inputs, weights, and outputs.

    Then, according to the module's quantization scheme, observers and calibration
    hooks are added. These hooks are disabled until the modifier starts.
    """
    if not QuantizationMixin.has_config(self):
        raise ValueError(
            "QuantizationModifier requires that quantization fields be specified"
        )
    QuantizationMixin.initialize_quantization(self, state.model)

    return True

on_start

on_start(state: State, event: Event, **kwargs)

开始校准激活和权重。在开始时只校准权重一次

源代码在 llmcompressor/modifiers/quantization/quantization/base.py

def on_start(self, state: State, event: Event, **kwargs):
    """
    Begin calibrating activations and weights. Calibrate weights only once on start
    """
    self.started_ = True
    QuantizationMixin.start_calibration(self, state.model)

    named_modules = list(
        match_named_modules(state.model, self.resolved_targets, self.ignore)
    )
    # TODO: this step can be combined with update_weight_zp_scale
    # once update_fused_layer_weight_global_scales is removed
    # and not required by vLLM
    for _, module in tqdm.tqdm(named_modules, desc="Updating global scales"):
        update_weight_global_scale(module)

    # NOTE: update_fused_layer_weight_global_scales operates on Attention
    # and MLP layers, not quantizable Linear layers. Rather than running
    # on targeted modules, we need to run on all modules.
    # Because this call is idempotent, setting all global_scales to the
    # min value, it is ok to run potentially multiple times for all modules
    for module in tqdm.tqdm(state.model.modules(), desc="Fusing global scales"):
        update_fused_layer_weight_global_scales(module)

    for _, module in tqdm.tqdm(named_modules, desc="Calibrating weights"):
        update_weight_zp_scale(module)

llmcompressor.modifiers.quantization

GPTQModifier

sequential_targets

块大小

阻尼分数

actorder

卸载Hessian矩阵

config_groups

targets

ignore

scheme

kv_cache_scheme

calibrate_module

module

args

_输出

compress_modules

on_end

on_finalize

state

on_initialize

state

QuantizationMixin

config_groups

targets

ignore

scheme

kv_cache_scheme

weight_observer

input_observer

output_observer

observer

resolved_config property

resolved_targets property

end_calibration

model

has_config

initialize_quantization

model

resolve_quantization_config

start_calibration

model

validate_observer

QuantizationModifier

config_groups

targets

ignore

scheme

kv_cache_scheme

on_end

on_initialize

on_start

`sequential_targets`

`块大小`

`阻尼分数`

`actorder`

`卸载Hessian矩阵`

`config_groups`

`targets`

`ignore`

`scheme`

`kv_cache_scheme`

`module`

`args`

`_输出`

`state`

`state`

`config_groups`

`targets`

`ignore`

`scheme`

`kv_cache_scheme`

`weight_observer`

`input_observer`

`output_observer`

`observer`

resolved_config `property`

resolved_targets `property`

`model`

`model`

`model`

`config_groups`

`targets`

`ignore`

`scheme`

`kv_cache_scheme`