`int4` 权重量化2:4稀疏模型

llm-compressor 支持在保持稀疏模式的同时量化权重，以节省内存并通过 vLLM 加速推理

vLLM 在 Nvidia 计算能力 > 8.0（Ampere、Ada Lovelace、Hopper）上支持 2:4 稀疏 + int4/int8 混合精度计算。

注意：以下示例不再包含微调，因为训练

自 v0.9.0 起，训练支持已弃用。要为您的稀疏模型应用微调，请参阅 Axolotl 集成博客文章以获取最佳
微调实践 https://developers.redhat.com/articles/2025/06/17/axolotl-meets-llm-compressor-fast-sparse-open

安装

首先，安装

git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
pip install -e .

快速入门

该示例包含一个用于应用量化算法的端到端脚本。

python3 llama7b_sparse_w4a16.py

创建稀疏量化 Llama7b 模型

此示例使用 LLMCompressor 和 Compressed-Tensors 创建一个 2:4 稀疏且量化的 Llama2-7b 模型。该模型使用 ultachat200k 数据集进行校准和训练。运行此示例需要至少 75GB 的 GPU 内存。

请按照以下步骤操作，或者直接运行示例 python examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py

步骤 1：选择模型、数据集和配方

在此步骤中，我们将选择要用作稀疏化基线的模型、用于校准和微调的数据集以及配方。

模型可以引用本地目录或 Huggingface hub 中的模型。

数据集可以来自本地兼容目录或 Huggingface hub。

配方是 YAML 文件，用于描述模型在训练期间或训练后如何进行优化。此流程使用的配方位于 2of4_w4a16_recipe.yaml。它包含将模型剪枝到 2:4 稀疏度、运行一个 epoch 的恢复微调，以及使用 GPTQ 在一次展示中量化到 4 位数的说明。

from pathlib import Path

import torch
from loguru import logger
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot, train

# load the model in as bfloat16 to save on memory and compute
model_stub = "neuralmagic/Llama-2-7b-ultrachat200k"
model = AutoModelForCausalLM.from_pretrained(model_stub, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_stub)

# uses LLM Compressor's built-in preprocessing for ultra chat
dataset = "ultrachat-200k"

# Select the recipe for 2 of 4 sparsity and 4-bit activation quantization
recipe = "2of4_w4a16_recipe.yaml"

# save location of quantized model
output_dir = "output_llama7b_2of4_w4a16_channel"
output_path = Path(output_dir)

# set dataset config parameters
splits = {"calibration": "train_gen[:5%]", "train": "train_gen"}
max_seq_length = 512
num_calibration_samples = 512
preprocessing_num_workers = 8

步骤 2：运行 `稀疏化` 和 `量化`

压缩过程现在分为两个阶段：稀疏化和量化。每个阶段都会将中间模型输出保存到 output_llama7b_2of4_w4a16_channel 目录。

from llmcompressor import oneshot, train
from pathlib import Path

output_dir = "output_llama7b_2of4_w4a16_channel"
output_path = Path(output_dir)

# 1. Oneshot sparsification: apply pruning
oneshot(
    model=model,
    **oneshot_kwargs,
    output_dir=output_dir,
    stage="sparsity_stage",
)


# 2. Oneshot quantization: compress model weights to lower precision
quantized_model = oneshot(
    model=(output_path / "sparsity_stage"),
    **oneshot_kwargs,
    stage="quantization_stage",
)

# skip_sparsity_compression_stats is set to False
# to account for sparsity in the model when compressing
quantized_model.save_pretrained(
    f"{output_dir}/quantization_stage", skip_sparsity_compression_stats=False
)
tokenizer.save_pretrained(f"{output_dir}/quantization_stage")

自定义量化

当前仓库支持多种量化技术，可通过配方进行配置。支持的策略有 tensor、group 和 channel。

配方 (2of4_w4a16_recipe.yaml) 使用通道级量化 (strategy: "channel")。要更改量化策略，请相应地编辑配方文件。

使用 tensor 进行每张量量化。使用 group 进行分组量化，并指定 group_size 参数（例如 128）。请参阅 2of4_w4a16_group-128_recipe.yaml 以获取分组大小示例。

int4 权重量化2:4稀疏模型