llmcompressor.modifiers.awq

模块

base –
mappings –

类

AWQMapping –

存储待平滑的激活映射配置的数据类
AWQModifier –

实现了 AWQ (Activation-Weighted Quantization) 算法，

函数

get_layer_mappings_from_architecture –

:param architecture: str: 模型的架构

AWQMapping `dataclass`

AWQMapping(smooth_layer: str, balance_layers: list[str])

存储待平滑的激活映射配置的数据类 `smooth_layer` 的输出激活是 `balance_layers` 的输入激活

AWQMappings 会被解析成 ResolvedMappings，它们在运行时保留了对实际 torch.nn.Modules 的指针和其他元数据

AWQModifier

Bases: Modifier, QuantizationMixin

实现了 AWQ (Activation-Weighted Quantization) 算法，如 https://arxiv.org/pdf/2306.00978 所述。该算法通过仅保护 1% 最显著的权重通道来显著降低量化误差。

AWQ 不依赖于原始权重值，而是通过分析激活模式来识别重要通道，重点关注权重张量中对输入最敏感的通道。为了减少量化误差，它通过缩放这些通道来保留模型的原始行为，使用从激活统计信息中离线计算的缩放因子。

由于此修改器会操作模型的权重，因此它只能用于一次性运行，而不能在训练期间使用。激活范围通过对少量校准数据运行模型来确定。

示例配方

AWQModifier:
  mappings:
    - smooth_layer: "re:.*self_attn_layer_norm"
      balance_layers: ["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"]
    - smooth_layer: "re:.*final_layer_norm"
      balance_layers: ["re:.*fc1"]
  ignore: ["lm_head"]
  config_groups:
    group_0:
      targets:
        - "Linear"
      input_activations: null
      output_activations: null
      weights:
        num_bits: 4
        type: int
        symmetric: false
        strategy: group
        group_size: 128

生命周期

on_initialize
- 解析映射
- 捕获前向传递所需的 kwargs 到模块中
on_start
- 设置激活缓存钩子，以捕获平衡层的输入激活
在序列 epoch 结束时
- 将平滑应用到每个平滑层
  - 跨批次消耗缓存的激活
    - 在使用时清除缓存的激活
  - 为每个平滑层查找最佳平滑尺度
  - 应用到模型权重
  - 如果存在任何未使用的激活，则抛出错误
on_end
- 重新运行序列 epoch 结束的逻辑 (以防是基本流水线)
- 设置尺度和零点
- 移除激活钩子
on_finalize
- 清除解析的映射和捕获的激活

参数

sequential_targets
–

在同一校准批次中要压缩的模块名称列表
mappings
–

列出要平滑的激活层，以及要缩放输出以实现激活平滑的层。映射列表的每个条目本身都应该是一个列表，其中第一个条目是共享相同输入激活（要平滑的激活）的层列表，第二个条目是输出被缩放以实现平滑的层。如果使用正则表达式，它将匹配模块名称中重叠最大的层。
ignore
–

要忽略的层列表，即使它们匹配映射中的正则表达式。它应该匹配其输出被缩放以实现平滑的层的名称（映射列表的第二个条目）。
offload_device
–

将缓存的 args 卸载到此设备，这会减少内存需求，但需要更多时间在 CPU 和执行设备之间移动数据。默认为 None，因此缓存的 args 不会被卸载。如果遇到 OOM 错误，请考虑将其设置为 torch.device("cpu")
duo_scaling
–

是否使用 duo scaling，它同时使用输入激活和权重来确定缩放因子。默认为 True。如果为 True，则同时使用激活和权重。如果为 False，则仅使用激活。如果为 "both"，则一半的网格搜索将使用 duo_scaling=False 进行，另一半将使用 duo_scaling=True 进行。
n_grid
–

在为每个映射执行最佳尺度网格搜索时，此参数指定要使用的网格点数。为了减少运行时间，可能会以稍微变差的尺度为代价，可以减小此值。默认为 20

方法

on_end –

通过设置尺度和零点来完成校准，
on_finalize –

通过清除激活和映射数据进行清理
on_initialize –

初始化给定状态下的 AWQ
validate_duo_scaling –

验证 duo_scaling 是否为 True、False 或 'both'（小写）

on_end

on_end(state: State, event: Event, **kwargs)

通过设置尺度和零点、移除观察者和校准钩子来完成校准

源代码在 llmcompressor/modifiers/awq/base.py

def on_end(self, state: State, event: Event, **kwargs):
    """
    Finish calibrating by setting scales and zero-points,
     removing observers and calibration hooks
    """
    self._assert_all_activations_consumed()

    self.ended_ = True

    for _, module in tqdm(
        match_named_modules(state.model, self.resolved_targets, self.ignore),
        desc="Calibrating weights",
    ):
        update_weight_zp_scale(module)

    QuantizationMixin.end_calibration(self, state.model)

    # remove activation hooks
    self.remove_hooks()

on_finalize

on_finalize(state: State, **kwargs) -> bool

通过清除激活和映射数据进行清理

参数

state
(State) –

未使用

返回

bool –

True

源代码在 llmcompressor/modifiers/awq/base.py

def on_finalize(self, state: State, **kwargs) -> bool:
    """
    Clean up by clearing the activations and mapping data

    :param state: unused
    :return: True
    """
    if not self.ended_:
        self.on_end(state, None)

    self._parent_args_cache.clear()
    self._smooth_activation_means.clear()
    self._resolved_mappings.clear()

    return True

on_initialize

on_initialize(state: State, **kwargs) -> bool

在给定状态下初始化 AWQ 初始化量化，解析映射，缓存模块 kwargs

参数

state
(State) –

要运行 AWQ 的状态

返回

bool –

成功运行为 True，否则为 False

源代码在 llmcompressor/modifiers/awq/base.py

def on_initialize(self, state: State, **kwargs) -> bool:
    """
    Initialize AWQ on the given state
    Initialize quantization, resolve mappings, cache module kwargs

    :param state: state to run AWQ on
    :return: True on a successful run, False otherwise
    """

    # apply config to model and prepare calibration hooks
    if QuantizationMixin.has_config(self):
        QuantizationMixin.initialize_quantization(self, state.model)

    # Validate that duo_scaling is only used with per-channel quantization
    if self.duo_scaling is not False:
        for _, module in match_named_modules(
            state.model, self.resolved_targets, self.ignore
        ):
            if (
                hasattr(module, "quantization_scheme")
                and hasattr(module.quantization_scheme, "weights")
                and module.quantization_scheme.weights.strategy
                == QuantizationStrategy.TENSOR
            ):
                raise ValueError(
                    "duo_scaling is only supported with per-channel quantization "
                    "strategies (group or channel), but found TENSOR strategy. "
                    "Please set duo_scaling=False or use a per-channel "
                    "quantization strategy."
                )

    if self.mappings is None:
        logger.info("No AWQModifier.mappings provided, inferring from model...")
        self.mappings = get_layer_mappings_from_architecture(
            architecture=state.model.__class__.__name__
        )

    self._set_resolved_mappings(state.model)

    return True

validate_duo_scaling `classmethod`

validate_duo_scaling(v)

验证 duo_scaling 是否为 True、False 或 'both'（小写）

源代码在 llmcompressor/modifiers/awq/base.py

@field_validator("duo_scaling")
@classmethod
def validate_duo_scaling(cls, v):
    """Validate that duo_scaling is either True, False, or 'both' (lowercase)"""
    if v not in (True, False, "both"):
        raise ValueError(f"duo_scaling must be True, False, or 'both', got {v!r}")
    return v

get_layer_mappings_from_architecture

get_layer_mappings_from_architecture(
    architecture: str,
) -> list[AWQMapping]

参数

architecture
(str) –

str: 模型的架构

返回

list[AWQMapping] –

list: 给定架构的层映射

源代码在 llmcompressor/modifiers/awq/mappings.py

def get_layer_mappings_from_architecture(architecture: str) -> list[AWQMapping]:
    """
    :param architecture: str: The architecture of the model
    :return: list: The layer mappings for the given architecture
    """

    if architecture not in AWQ_MAPPING_REGISTRY:
        logger.info(
            f"Architecture {architecture} not found in mappings. "
            f"Using default mappings: {_default_mappings}"
        )

    return AWQ_MAPPING_REGISTRY.get(architecture, _default_mappings)

llmcompressor.modifiers.awq

AWQMapping `dataclass`

AWQModifier

`sequential_targets`

`mappings`

`ignore`

`offload_device`

`duo_scaling`

`n_grid`

on_end

on_finalize

`state`

on_initialize

`state`

validate_duo_scaling `classmethod`

get_layer_mappings_from_architecture

`architecture`

llmcompressor.modifiers.awq

AWQMapping dataclass

AWQModifier

sequential_targets

mappings

ignore

offload_device

duo_scaling

n_grid

on_end

on_finalize

state

on_initialize

state

validate_duo_scaling classmethod

get_layer_mappings_from_architecture

architecture

AWQMapping `dataclass`

`sequential_targets`

`mappings`

`ignore`

`offload_device`

`duo_scaling`

`n_grid`

`state`

`state`

validate_duo_scaling `classmethod`

`architecture`