跳到内容

llmcompressor.args.dataset_arguments

LLM 压缩工作流程的数据集参数类。

此模块定义了基于 dataclass 的参数容器,用于在不同数据集源和处理管道中配置数据集加载、预处理和校准参数。支持各种输入格式,包括 HuggingFace 数据集、自定义 JSON/CSV 文件以及 DVC 管理的数据集。

CustomDatasetArguments dataclass

CustomDatasetArguments(
    dvc_data_repository: str | None = None,
    dataset_path: str | None = None,
    text_column: str = "text",
    remove_columns: None | str | list[str] = None,
    preprocessing_func: None | str | Callable = None,
    batch_size: int = 1,
    data_collator: str | Callable = "truncation",
)

Bases: DVCDatasetArguments

使用自定义数据集进行校准的参数

DVCDatasetArguments dataclass

DVCDatasetArguments(dvc_data_repository: str | None = None)

使用 DVC 进行校准的参数

DatasetArguments dataclass

DatasetArguments(
    dvc_data_repository: str | None = None,
    dataset_path: str | None = None,
    text_column: str = "text",
    remove_columns: None | str | list[str] = None,
    preprocessing_func: None | str | Callable = None,
    batch_size: int = 1,
    data_collator: str | Callable = "truncation",
    dataset: str | None = None,
    dataset_config_name: str | None = None,
    max_seq_length: int = 384,
    concatenate_data: bool = False,
    raw_kwargs: dict = dict(),
    splits: None | str | list[str] | dict[str, str] = None,
    num_calibration_samples: int | None = 512,
    shuffle_calibration_samples: bool = False,
    streaming: bool | None = False,
    overwrite_cache: bool = False,
    preprocessing_num_workers: int | None = None,
    pad_to_max_length: bool = True,
    min_tokens_per_module: float | None = None,
    moe_calibrate_all_experts: bool = True,
    pipeline: str | None = "independent",
    tracing_ignore: list[str] = (
        lambda: [
            "_update_causal_mask",
            "create_causal_mask",
            "_update_mamba_mask",
            "make_causal_mask",
            "get_causal_mask",
            "mask_interface",
            "mask_function",
            "_prepare_4d_causal_attention_mask",
            "_prepare_fsmt_decoder_inputs",
            "_prepare_4d_causal_attention_mask_with_cache_position",
            "_update_linear_attn_mask",
            "project_per_layer_inputs",
        ]
    )(),
    sequential_targets: list[str] | None = None,
    sequential_offload_device: str = "cpu",
    quantization_aware_calibration: bool = True,
)

Bases: CustomDatasetArguments

与我们将用于校准的数据相关的参数

使用 HfArgumentParser 我们可以将此类转换为 argparse 参数,以便能够在命令行上指定它们。