跳到内容

llmcompressor.observers

用于在压缩过程中监控和分析模型行为的框架。

提供了观察器,用于在压缩工作流程中跟踪张量统计信息、激活范围和模型行为。包括 min-max 观察器、MSE 观察器以及量化和其他压缩技术的辅助工具。

模块

函数

MemorylessMinMaxObserver

MemorylessMinMaxObserver(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: Observer

通过观察值的 min/max 来计算量化参数

参数

  • base_name

    (str) –

    用于命名观察器属性的字符串

  • args

    (QuantizationArgs) –

    用于校准和量化观测值的量化参数

  • module

    (Optional[Module], 默认值: None ) –

    可选模块,附带量化参数。此参数是利用现有 qparams(例如 global_scale 或 g_idx)所必需的

  • **observer_kwargs

    观察器初始化关键字参数

源代码位于 llmcompressor/observers/base.py
def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__()
    self.module = ref(module) if module is not None else None
    self.base_name = base_name
    self.args = args

    # populate observer kwargs
    self.args.observer_kwargs = self.args.observer_kwargs or {}
    self.args.observer_kwargs.update(observer_kwargs)

MinMaxObserver

MinMaxObserver(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: MovingAverageObserverBase

通过所有 min/max 值的移动平均来计算量化参数

参数

  • base_name

    (str) –

    用于命名观察器属性的字符串

  • args

    (QuantizationArgs) –

    用于校准和量化观测值的量化参数

  • module

    (Optional[Module], 默认值: None ) –

    可选模块,附带量化参数。此参数是利用现有 qparams(例如 global_scale 或 g_idx)所必需的

  • **observer_kwargs

    观察器初始化关键字参数

源代码位于 llmcompressor/observers/moving_base.py
def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__(base_name, args, module, **observer_kwargs)
    self.avg_constant = self.args.observer_kwargs.get("averaging_constant", 0.01)

    self.past_min_vals = None
    self.past_max_vals = None
    self.past_global_min_vals = None
    self.past_global_max_vals = None

MovingAverageMSEObserver

MovingAverageMSEObserver(*args, **kwargs)

Bases: MovingAverageObserverBase

通过找到最小化均方量化误差的最小/最大值来计算量化参数。

mse_quant_error := mean((x - fake_quant(x))**2)
global_scale <- min[min_vals, max_vals, global_scale](mse_quant_error(x))
scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))

参数

  • base_name

    用于命名观察器属性的字符串

  • args

    用于校准和量化观测值的量化参数

  • module

    可选模块,附带量化参数。此参数是利用现有 qparams(例如 global_scale 或 g_idx)所必需的

  • **observer_kwargs

    用于观察器初始化的关键字参数 maxshrink: 最大收缩量(以“网格步长”为单位)。搜索步数是 int(maxshrink * grid) patience: 连续搜索步数无改进则提前停止 grid: 收缩搜索的粒度。值越大,收缩因子的粒度越细 norm: 计算误差时使用的指数。norm = 2 近似 MSE global_scale: 用于量化的预计算全局尺度。如果 optimize_global_scale 为 True,则忽略 optimize_global_scale: 如果为 True,则在搜索的每一步中从候选的 min/max 重新计算 global_scale

Source code in llmcompressor/observers/mse.py
def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    observer_kwargs = self.args.observer_kwargs
    self.maxshrink = observer_kwargs.get("maxshrink", 0.20)
    self.patience = observer_kwargs.get("patience", 5)
    self.grid = observer_kwargs.get("grid", 100.0)
    self.norm = observer_kwargs.get("norm", 2.4)

MovingAverageObserverBase

MovingAverageObserverBase(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: Observer

通过 min/max 值的移动平均来计算量化参数

参数

  • base_name

    (str) –

    用于命名观察器属性的字符串

  • args

    (QuantizationArgs) –

    用于校准和量化观测值的量化参数

  • module

    (Optional[Module], 默认值: None ) –

    可选模块,附带量化参数。此参数是利用现有 qparams(例如 global_scale 或 g_idx)所必需的

  • **observer_kwargs

    观察器初始化关键字参数

方法

源代码位于 llmcompressor/observers/moving_base.py
def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__(base_name, args, module, **observer_kwargs)
    self.avg_constant = self.args.observer_kwargs.get("averaging_constant", 0.01)

    self.past_min_vals = None
    self.past_max_vals = None
    self.past_global_min_vals = None
    self.past_global_max_vals = None

get_current_global_min_max abstractmethod

get_current_global_min_max(observed: Tensor) -> MinMaxTuple

为了全局尺度计算的目的,计算观察值的最小和最大值(不带移动平均)

源代码位于 llmcompressor/observers/moving_base.py
@abstractmethod
def get_current_global_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate the min and max value of the observed value (without moving average)
    for the purposes of global scale calculation
    """
    raise NotImplementedError()

get_current_min_max abstractmethod

get_current_min_max(observed: Tensor) -> MinMaxTuple

计算观察值的最小和最大值(不带移动平均)

源代码位于 llmcompressor/observers/moving_base.py
@abstractmethod
def get_current_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate the min and max value of the observed value (without moving average)
    """
    raise NotImplementedError()

get_global_min_max

get_global_min_max(observed: Tensor) -> MinMaxTuple

为了全局尺度计算的目的,计算观察值 min 和 max 的移动平均

参数

  • observed

    (Tensor) –

    被观察的值,其形状为 (num_observations, 1, group_size)

返回

  • MinMaxTuple

    最小值和最大值,其形状为 (1, )

源代码位于 llmcompressor/observers/moving_base.py
def get_global_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate moving average of min and max values from observed value
    for the purposes of global scale calculation

    :param observed: value being observed whose shape is
        (num_observations, 1, group_size)
    :return: minimum value and maximum value whose shapes are (1, )
    """
    min_vals, max_vals = self.get_current_global_min_max(observed)

    if self.past_global_min_vals is not None and self.avg_constant != 1.0:
        # FUTURE: consider scaling by num observations (first dim)
        #         rather than reducing by first dim
        min_vals = self._lerp(
            self.past_global_min_vals, min_vals, self.avg_constant
        )
        max_vals = self._lerp(
            self.past_global_max_vals, max_vals, self.avg_constant
        )

    self.past_global_min_vals = min_vals
    self.past_global_max_vals = max_vals

    return min_vals, max_vals

get_min_max

get_min_max(observed: Tensor) -> MinMaxTuple

计算观察值 min 和 max 的移动平均

参数

  • observed

    (Tensor) –

    被观察的值,其形状为 (num_observations, *qparam_shape, group_size)

返回

  • MinMaxTuple

    最小值和最大值,其形状为 (*qparam_shape, )

源代码位于 llmcompressor/observers/moving_base.py
def get_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate moving average of min and max values from observed value

    :param observed: value being observed whose shape is
        (num_observations, *qparam_shape, group_size)
    :return: minimum value and maximum value whose shapes are (*qparam_shape, )
    """
    min_vals, max_vals = self.get_current_min_max(observed)

    if self.past_min_vals is not None and self.avg_constant != 1.0:
        # FUTURE: consider scaling by num observations (first dim)
        #         rather than reducing by first dim
        min_vals = self._lerp(self.past_min_vals, min_vals, self.avg_constant)
        max_vals = self._lerp(self.past_max_vals, max_vals, self.avg_constant)

    self.past_min_vals = min_vals
    self.past_max_vals = max_vals

    return min_vals, max_vals

Observer

Observer(
    base_name: str,
    args: QuantizationArgs,
    module: Optional[Module] = None,
    **observer_kwargs,
)

Bases: InternalModule, RegistryMixin

观察器基类,用于根据权重、激活或注意力状态的观察值计算量化参数。

示例

module = ...
observer = Observer.load_from_registry(observer, base_name="weight", args=...)
module.global_scale = observer.get_global_scale(module.weight)
scales, zero_points = observer(module.weight)

参数

  • base_name

    (str) –

    用于命名观察器属性的字符串

  • args

    (QuantizationArgs) –

    用于校准和量化观测值的量化参数

  • module

    (Optional[Module], 默认值: None ) –

    可选模块,附带量化参数。此参数是利用现有 qparams(例如 global_scale 或 g_idx)所必需的

  • **observer_kwargs

    观察器初始化关键字参数

方法

源代码位于 llmcompressor/observers/base.py
def __init__(
    self,
    base_name: str,
    args: QuantizationArgs,
    module: Optional[torch.nn.Module] = None,
    **observer_kwargs,
):
    super().__init__()
    self.module = ref(module) if module is not None else None
    self.base_name = base_name
    self.args = args

    # populate observer kwargs
    self.args.observer_kwargs = self.args.observer_kwargs or {}
    self.args.observer_kwargs.update(observer_kwargs)

forward

forward(observed: Tensor) -> ScaleZpTuple

根据观察值(权重、激活或注意力状态)计算更新的尺度和零点。

参数

  • observed

    (Tensor) –

    被观察的值

返回

  • ScaleZpTuple

    校准的尺度和零点

源代码位于 llmcompressor/observers/base.py
@torch.no_grad
def forward(self, observed: torch.Tensor) -> ScaleZpTuple:
    """
    Calculate updated scales and zero points from observed value
    (weight, activation, or attention state).

    :param observed: value being observed
    :return: calibrated scale and zero point
    """
    scales, zero_points, _min, _max = self._forward_with_minmax(observed)
    return (scales, zero_points)

get_global_min_max abstractmethod

get_global_min_max(observed: Tensor) -> MinMaxTuple

为了全局尺度计算的目的,根据观察值计算最小和最大值

参数

  • observed

    (Tensor) –

    形状为 (num_observations, 1, group_size) 的值

返回

  • MinMaxTuple

    最小值和最大值,其形状为 (1, )

源代码位于 llmcompressor/observers/base.py
@abstractmethod
def get_global_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate min and max values from observed value for the purposes of
    global scale calculation

    :param observed: value of shape (num_observations, 1, group_size)
    :return: minimum value and maximum value whose shapes are (1, )
    """
    raise NotImplementedError()

get_global_scale

get_global_scale(observed: Tensor) -> torch.Tensor

根据观察值(权重、激活或注意力状态)计算更新的全局尺度。

参数

  • observed

    (Tensor) –

    被观察的值

返回

  • Tensor

    校准的全局参数

源代码位于 llmcompressor/observers/base.py
@torch.no_grad
def get_global_scale(self, observed: torch.Tensor) -> torch.Tensor:
    """
    Calculate updated global scale from observed value
    (weight, activation, or attention state).

    :param observed: value being observed
    :return: calibrated global parameter
    """
    global_scale, _min, _max = self._get_global_scale_with_minmax(observed)
    return global_scale

get_min_max abstractmethod

get_min_max(observed: Tensor) -> MinMaxTuple

根据观察值计算最小和最大值

参数

  • observed

    (Tensor) –

    形状为 (num_observations, *qparam_shape, group_size) 的值

返回

  • MinMaxTuple

    最小值和最大值,其形状为 (*qparam_shape, )

源代码位于 llmcompressor/observers/base.py
@abstractmethod
def get_min_max(self, observed: torch.Tensor) -> MinMaxTuple:
    """
    Calculate min and max values from observed value

    :param observed: value of shape (num_observations, *qparam_shape, group_size)
    :return: minimum value and maximum value whose shapes are (*qparam_shape, )
    """
    raise NotImplementedError()

StaticMinMaxObserver

StaticMinMaxObserver(*args, **kwargs)

Bases: Observer

通过所有观察值的 min/max 来计算量化参数

参数

  • base_name

    用于命名观察器属性的字符串

  • args

    用于校准和量化观测值的量化参数

  • module

    可选模块,附带量化参数。此参数是利用现有 qparams(例如 global_scale 或 g_idx)所必需的

  • **observer_kwargs

    观察器初始化关键字参数

Source code in llmcompressor/observers/min_max.py
def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self.past_min_vals = None
    self.past_max_vals = None
    self.past_global_min_vals = None
    self.past_global_max_vals = None

flatten_for_calibration

flatten_for_calibration(
    value: Tensor,
    base_name: str,
    args: QuantizationArgs,
    g_idx: Optional[Tensor] = None,
) -> torch.Tensor

为了尺度/零点校准的目的,根据量化策略重塑值。展平后的值具有以下形状

(num_observations, *qparam_shape, group_size)

第一个维度是观察次数(通常是批次大小乘以 token 数),中间维度是尺度的大小,最后一个维度是每组量化的元素数。

参数

  • (Tensor) –

    被展平的值

  • base_name

    (str) –

    weight, input, output, q/k/v。用于表征该值是权重、激活还是注意力状态

  • args

    (QuantizationArgs) –

    用于确定值如何展平的量化参数

  • g_idx

    (Optional[Tensor], 默认值: None ) –

    用于权重激活排序的可选 gidx

返回

  • Tensor

    已重塑用于校准的值

Source code in llmcompressor/observers/helpers.py
def flatten_for_calibration(
    value: torch.Tensor,
    base_name: str,
    args: QuantizationArgs,
    g_idx: Optional[torch.Tensor] = None,
) -> torch.Tensor:
    """
    Reshapes the value according to the quantization strategy for the purposes of
    scale/zp calibration. The value after flattening has the following shape:

    `(num_observations, *qparam_shape, group_size)`

    The first dim is the number of observations (usually the batch size times number of
    tokens), the middle dims are the dimension of the scales, and the last dim is the
    number of elements being quantized per group.

    :param value: value being flattened
    :param base_name: weight, input, output, q/k/v. Used to characterize the value as
        being a weight, activation, or attention state
    :param args: quantization args for determining how the value is flattened
    :param g_idx: optional gidx for weight activation ordering
    :return: value which has been reshaped for calibration
    """
    if base_name == "weight":
        return _flatten_weight(value, args, g_idx)
    elif base_name in ("input", "output"):
        return _flatten_activation(value, args)
    elif base_name in ("q", "k", "v"):
        return _flatten_attention(value, args)
    else:
        raise ValueError(f"Unknown quantization base name: {base_name}")