llmcompressor.modifiers.quantization.gptq.base

类

GPTQModifier –

实现 https://arxiv.org/abs/2210.17323 中的 GPTQ 算法。此修饰符

GPTQModifier

Bases: Modifier, QuantizationMixin

实现了来自 https://arxiv.org/abs/2210.17323 的 GPTQ 算法。此 Modifier 使用激活来校准 Hessian 矩阵，然后该矩阵用于确定模型权重的最佳量化值和顺序。

示例 yaml

test_stage:
  obcq_modifiers:
    GPTQModifier:
      block_size: 128
      dampening_frac: 0.001
      offload_hessians: False
      actorder: static
      config_groups:
        group_0:
          targets:
            - "Linear"
          input_activations: null
          output_activations: null
          weights:
            num_bits: 8
            type: "int"
            symmetric: true
            strategy: group
            group_size: 128

生命周期

on_initialize
- 将配置应用于模型
on_start
- 添加激活校准钩子
- 添加 gptq 权重校准钩子
on_sequential_epoch_end
- quantize_weight
on_finalize
- remove_hooks()
- model.apply(freeze_module_quantization)

参数

sequential_targets
–

GPTQ 期间要压缩的层名称列表，或 'ALL' 以压缩模型中的每一层
块大小
–

用于确定一次通过中要压缩的列数
阻尼分数
–

施加到 H 的阻尼量，作为对角范数的百分比
actorder
–

权重列的量化顺序。默认为“静态”激活顺序，可实现最佳精度恢复，且无运行时开销。更多信息请参见 https://github.com/vllm-project/vllm/pull/8135
卸载Hessian矩阵
–

设置为 True 可减少内存使用但增加运行时。
config_groups
–

字典，指定要应用于目标模块的量化方案。不符合方案目标的模块将不被量化。
targets
–

如果提供了方案，则为要量化的层名称列表。默认为线性层。
ignore
–

可选的模块类名或子模块名列表，即使它们与 config_groups 中的目标匹配，也不进行量化。默认为空列表。
scheme
–

应用于模型的单个量化方案。这是一个字典，支持 QuantizationScheme 的所有键，除了 targets，它将被设置为修饰符级别设置的 targets 参数。也可以设置为 preset_scheme_name: targets 格式的字典，例如：W8A8: ['Linear'] 用于权重和激活的 8 位量化。
kv_cache_scheme
–

可选的 QuantizationArgs，用于指定 kv 缓存的量化。如果为 None，则不量化 kv 缓存。当将 kv 缓存量化应用于 transformer AutoModelForCausalLM 时，kv_cache_scheme 会被转换为一个 QuantizationScheme，该 Scheme： - 目标模型中的 q_proj 和 k_proj 模块。这些模块的输出是可能被缓存的键和值 - 对上述层的输出进行量化，以便在将键和值存储到缓存之前对其进行压缩。有一个明确的假设是模型包含名称中带有 k_proj 和 v_proj 的模块。如果不是这种情况，并且 kv_cache_scheme != None，则 kv 缓存的量化将失败。

方法

calibrate_module –

校准钩子，用于累积模块输入的 Hessian 矩阵
compress_modules –

量化已校准的模块
on_end –

通过移除观察者和校准钩子来完成校准
on_finalize –

禁用 OBCQ 算法使用的量化观察器
on_initialize –

在当前状态下初始化并运行 GPTQ 算法

calibrate_module

calibrate_module(
    module: Module,
    args: Tuple[Tensor, ...],
    _output: Tensor,
)

校准钩子，用于累积模块输入的 Hessian 矩阵

参数

module
(Module) –

正在校准的模块
args
(Tuple[Tensor, ...]) –

模块的输入，其中第一个元素是规范输入
_输出
(Tensor) –

未压缩的模块输出，未使用

源代码位于 llmcompressor/modifiers/quantization/gptq/base.py

def calibrate_module(
    self,
    module: torch.nn.Module,
    args: Tuple[torch.Tensor, ...],
    _output: torch.Tensor,
):
    """
    Calibration hook used to accumulate the hessian of the input to the module

    :param module: module being calibrated
    :param args: inputs to the module, the first element of which is the
        canonical input
    :param _output: uncompressed module output, unused
    """
    # Assume that first argument is the input
    inp = args[0]

    # Initialize hessian if not present
    if module not in self._num_samples:
        init_device = (
            "cpu" if self.offload_hessians else get_execution_device(module)
        )
        self._hessians[module] = make_empty_hessian(module, device=init_device)
        self._num_samples[module] = 0

    # Accumulate hessian with input with optional offloading
    with self._maybe_onload_hessian(module):
        self._hessians[module], self._num_samples[module] = accumulate_hessian(
            inp,
            module,
            self._hessians[module],
            self._num_samples[module],
        )

compress_modules

compress_modules()

量化已校准的模块

源代码位于 llmcompressor/modifiers/quantization/gptq/base.py

def compress_modules(self):
    """
    Quantize modules which have been calibrated
    """
    for module in list(self._num_samples.keys()):
        name = self._module_names[module]
        num_samples = self._num_samples[module]
        quant_args = getattr_chain(module, "quantization_scheme.weights")

        logger.info(f"Quantizing {name} using {num_samples} samples")
        with torch.no_grad(), align_module_device(
            module
        ), self._maybe_onload_hessian(module), CompressionLogger(
            module
        ) as comp_logger:
            loss, quantized_weight, scale, zero_point, g_idx = quantize_weight(
                module=module,
                quant_args=quant_args,
                hessians_dict=self._hessians,
                blocksize=self.block_size,
                percdamp=self.dampening_frac,
            )
            comp_logger.set_loss(loss)

        update_offload_parameter(module, "weight", quantized_weight)
        update_offload_parameter(module, "weight_scale", scale)
        update_offload_parameter(module, "weight_zero_point", zero_point)
        if g_idx is not None:
            update_offload_parameter(module, "weight_g_idx", g_idx)

        # self._hessians[module] already deleted by quantize_weight
        del self._num_samples[module]

on_end

on_end(state: State, event: Event, **kwargs)

通过移除观察者和校准钩子来完成校准

源代码位于 llmcompressor/modifiers/quantization/gptq/base.py

def on_end(self, state: State, event: Event, **kwargs):
    """
    Finish calibrating by removing observers and calibration hooks
    """
    self.ended_ = True
    QuantizationMixin.end_calibration(self, state.model)
    self.remove_hooks()  # remove gptq hooks

on_finalize

on_finalize(state: State, **kwargs) -> bool

禁用 OBCQ 算法使用的量化观察器

参数

state
(State) –

存储输入模型和校准数据的会话状态

源代码位于 llmcompressor/modifiers/quantization/gptq/base.py

def on_finalize(self, state: State, **kwargs) -> bool:
    """
    disable the quantization observers used by the OBCQ algorithm

    :param state: session state storing input model and calibration data
    """
    if not self.ended_:
        self.on_end(state, None)

    if len(self._num_samples) > 0:
        raise ValueError(f"Failed to compress {len(self._num_samples)} modules")

    self._hessians = dict()
    self._num_samples = dict()

    return True

on_initialize

on_initialize(state: State, **kwargs) -> bool

在当前状态下初始化并运行 GPTQ 算法

参数

state
(State) –

存储输入模型和校准数据的会话状态

源代码位于 llmcompressor/modifiers/quantization/gptq/base.py

def on_initialize(self, state: State, **kwargs) -> bool:
    """
    Initialize and run the GPTQ algorithm on the current state

    :param state: session state storing input model and calibration data
    """
    # apply config to model and prepare calibration hooks
    if QuantizationMixin.has_config(self):
        QuantizationMixin.initialize_quantization(self, state.model)

    # prepare module names
    self._module_names = {
        m: name
        for name, m in match_named_modules(
            state.model, self.resolved_targets, self.ignore
        )
    }

    return True

llmcompressor.modifiers.quantization.gptq.base

GPTQModifier

`sequential_targets`

`块大小`

`阻尼分数`

`actorder`

`卸载Hessian矩阵`

`config_groups`

`targets`

`ignore`

`scheme`

`kv_cache_scheme`

calibrate_module

`module`

`args`

`_输出`

compress_modules

on_end

on_finalize

`state`

on_initialize

`state`