使用顺序加载进行大模型处理

什么是顺序加载？

顺序加载是一种内存高效的方法，仅使用单个 GPU 来压缩大型语言模型 (LLM)。此方法不是将整个模型加载到内存中（这很容易需要数百 GB），而是一次加载和压缩一个层。在处理下一层之前，输出会被卸载，从而显著降低峰值内存使用量，同时保持高压缩保真度。

有关更多信息，请参阅 RedHat AI 博文或 LLM Compressor 办公时间录像。

使用顺序加载

顺序加载在 LLM Compressor 中是默认启用的。要禁用顺序加载，请将 pipeline="basic" 参数添加到 LLM Compressor 的 oneshot 函数调用中。

运行 Llama 3.3 70b

Llama 3.3 70b 模型的大小超过 80 GB，超过了 1 块 A100 GPU 的容量。但是，通过顺序加载，仍然可以使用单个 GPU 无缝地对该模型进行量化。

代码演练

model_id = "meta-llama/Llama-3.3-70B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map=None)

模型首先被加载到 cpu 上，这通过在加载模型时对 from_pretrained 方法的 device_map 参数使用 None 来表示。

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

在 oneshot 过程中，只需要一个 GPU，它将用于按顺序加载每个层进行校准。

dispatch_for_generation(model)
sample = tokenizer("Hello my name is", return_tensors="pt")
sample = {key: value.to(model.device) for key, value in sample.items()}
output = model.generate(**sample, max_new_tokens=100)
print(tokenizer.decode(output[0]))

最后，我们调用 dispatch_for_generation 将模型均匀加载到可用设备上（如果需要，可能还会卸载模型），并在新量化的模型上运行样本生成。