BitsAndBytes

vLLM 现在支持 BitsAndBytes，以实现更高效的模型推理。BitsAndBytes 对模型进行量化，以减少内存使用并提高性能，而不会显著牺牲精度。与其他量化方法相比，BitsAndBytes 无需使用输入数据校准量化模型。

以下是使用 BitsAndBytes 配合 vLLM 的步骤。

pip install bitsandbytes>=0.45.3

vLLM 读取模型的配置文件，并支持即时量化和预量化检查点。

您可以在 Hugging Face 上找到 bitsandbytes 量化模型。通常，这些仓库包含一个 config.json 文件，其中包含 quantization_config 部分。

读取量化后的检查点¶

对于预量化检查点，vLLM 会尝试从配置文件中推断出量化方法，因此您无需明确指定量化参数。

from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(
    model=model_id,
    dtype=torch.bfloat16,
    trust_remote_code=True
)

即时量化：加载为 4bit 量化¶

对于使用 BitsAndBytes 进行即时 4bit 量化，您需要明确指定量化参数。

from vllm import LLM
import torch
model_id = "huggyllama/llama-7b"
llm = LLM(
    model=model_id,
    dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization="bitsandbytes"
)

OpenAI 兼容服务器¶

将以下内容添加到您的模型参数中，以进行 4bit 即时量化

--quantization bitsandbytes