跳到内容

llmcompressor.modifiers.autoround

模块

AutoRoundModifier

基类:Modifier, QuantizationMixin

实现了来自 https://aclanthology.org/2024.findings-emnlp.662.pdf 的 AutoRound 算法。此修改器利用带符号梯度下降 (SignSGD) 优化器和块式损失,在几个步骤内优化舍入值和权重裁剪。

示例 yaml

test_stage:
  modifiers:
    AutoRoundModifier:
      iters: 200
      config_groups:
        group_0:
          targets:
            - "Linear"
          input_activations: null
          output_activations: null
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: group
            group_size: 128

生命周期

  • on_initialize
    • 将配置应用于模型
  • on_start
    • 向解码层添加输入捕获钩子
  • on_sequential_epoch_end
    • apply_autoround
    • post_autoround_cleanup
  • on_finalize
    • remove_hooks()
    • model.apply(freeze_module_quantization)

参数

  • config_groups

    字典,指定要应用于目标模块的量化方案。不符合方案目标的模块将不被量化。

  • targets

    如果提供了方案,则为要量化的层名称列表。默认为线性层。

  • ignore

    可选的模块类名或子模块名列表,即使它们与 config_groups 中的目标匹配,也不进行量化。默认为空列表。

  • scheme

    应用于模型的单个量化方案。这是一个支持 QuantizationScheme 所有键的字典,除了 targets,它将设置为在修改器级别设置的 targets 参数。

方法

  • apply_autoround

    对当前解码层应用 AutoRound 量化调优。

  • on_end

    通过移除观察者和校准钩子来完成校准

  • on_finalize

    禁用 AutoRound 算法使用的量化观察器

  • on_initialize

    初始化模型的量化和校准状态。

  • start_calibration

    注册激活校准钩子并在我们校准时启用量化

apply_autoround

apply_autoround(state, subgraph)

对当前解码层应用 AutoRound 量化调优。

调优逻辑如下:对于 iter in range(iters): quant_output = forward(layer, cached_inputs) loss = mse_loss(quant_output, original_output) loss.backward() optimizer.step() if loss < best_loss: best_params = update_params(layer)

更多详情,请参考 AutoRound 仓库: https://github.com/intel/auto-round/

源代码位于 llmcompressor/modifiers/autoround/base.py
def apply_autoround(self, state, subgraph):
    """
    Applies AutoRound quantization tuning on the current decoding layer.

    The tuning logic is as follows:
    for iter in range(iters):
        quant_output = forward(layer, cached_inputs)
        loss = mse_loss(quant_output, original_output)
        loss.backward()
        optimizer.step()
        if loss < best_loss:
            best_params = update_params(layer)

    For more details, please refer to the AutoRound repository:
    https://github.com/intel/auto-round/
    """
    modules = list(subgraph.submodules(model=state.model))

    decoding_layers = [m for m in modules if self._is_decoding_layer(m)]
    if len(decoding_layers) == 0:
        return
    assert len(decoding_layers) == 1, (
        "Only one decoding layer is expected in the subgraph, "
        f"found {len(decoding_layers)}."
    )
    decoding_layer = decoding_layers[0]

    logger.info("Applying AutoRound on layer {}", decoding_layer._tmp_name)

    wrapped_model = _wrap_decoding_layer(decoding_layer)
    wrapped_model.name_or_path = state.model.name_or_path

    with torch.enable_grad(), align_module_device(decoding_layer):
        ar_quant_scheme = self._mapping_config_to_autoround()
        ar = AutoRound(
            model=wrapped_model,
            tokenizer="",
            scheme=ar_quant_scheme,
            iters=self.iters,
            enable_torch_compile=self.enable_torch_compile,
            batch_size=self.batch_size,
        )
        # TODO: configure layer-wise config based on self.resolved_config
        ar.configure_layer_config(enable_gguf_official_mixed=False)
        ar.batch_dim = 0
        first_param = next(decoding_layer.parameters())
        device = first_param.device
        cur_inputs = self._all_module_input[decoding_layer._tmp_name]
        decoding_layer.tuning_device = device

        q_input, _ = ar.quantize_block(
            block=decoding_layer,
            inputs=cur_inputs,
            q_input=self._q_input,
            device=str(device),
            # Leave offload for LLMC
            auto_offload=False,
        )
        self._q_input = q_input
        # Update offload parameters and remove temporary attributes
        for _, module in decoding_layer.named_modules():
            if hasattr(module, "weight_scale") and hasattr(
                module, "weight_zero_point"
            ):
                # Note: The model's weight is already q-dq in-place by auto-round.
                weight_scale = module.scale
                del module.scale
                # TODO: update zero_point after supporting asymmetric quantization
                update_offload_parameter(module, "weight_scale", weight_scale)
    decoding_layer.eval()

on_end

on_end(state: State, event: Event, **kwargs)

通过移除观察者和校准钩子来完成校准

源代码位于 llmcompressor/modifiers/autoround/base.py
def on_end(self, state: State, event: Event, **kwargs):
    """
    Finish calibrating by removing observers and calibration hooks
    """
    self.ended_ = True
    QuantizationMixin.end_calibration(self, state.model)
    self._remove_temporary_names(state.model)
    self.remove_hooks()
    self._q_input = None

on_finalize

on_finalize(state: State, **kwargs) -> bool

禁用 AutoRound 算法使用的量化观察器

参数

  • state

    (State) –

    存储输入模型和校准数据的会话状态

源代码位于 llmcompressor/modifiers/autoround/base.py
def on_finalize(self, state: State, **kwargs) -> bool:
    """
    disable the quantization observers used by the AutoRound algorithm

    :param state: session state storing input model and calibration data
    """
    if not self.ended_:
        self.on_end(state, None)

    return True

on_initialize

on_initialize(state: State, **kwargs) -> bool

初始化模型的量化和校准状态。

参数

  • state

    (State) –

    存储输入模型和校准数据的会话状态

源代码位于 llmcompressor/modifiers/autoround/base.py
def on_initialize(self, state: State, **kwargs) -> bool:
    """
    Initialize the model state for quantization and calibration.

    :param state: session state storing input model and calibration data
    """
    # apply config to model and prepare calibration hooks
    if QuantizationMixin.has_config(self):
        QuantizationMixin.initialize_quantization(self, state.model)

    # prepare module names
    self._add_temporary_names(state.model)
    # freeze all model parameters
    for _, param in state.model.named_parameters():
        param.requires_grad_(False)

    self.sequential_targets = self._infer_sequential_targets(state.model)
    return True

start_calibration

start_calibration(model: Module)

注册激活校准钩子并在我们校准时启用量化

参数

  • model

    (Module) –

    准备用于校准的模型

源代码位于 llmcompressor/modifiers/autoround/base.py
def start_calibration(self, model: torch.nn.Module):
    """
    Register activation calibration hooks and enable quantization as we calibrate

    :param model: model to prepare for calibration
    """
    targets = match_named_modules(model, self.targets, self.ignore)
    if targets_embeddings(model, targets):
        untie_word_embeddings(model)

    for _, module in match_named_modules(model, self.targets, self.ignore):
        # Note: No need to register observers for auto-round
        self._calibration_hooks |= self._initialize_hooks(module)
        apply_calibration_status(module)

    model.apply(enable_quantization)  # quantize at the same time as calibrate