跳到内容

llmcompressor.transformers.data.peoples_speech

PeoplesSpeech

PeoplesSpeech(
    dataset_args: DataTrainingArguments,
    split: str,
    processor: Processor,
)

基类:TextGenerationDataset

ML Commons People's Speech 音频数据集

不幸的是,由于音频模型预处理的特殊性,一些特定于模型的代码必须在此定义。此数据集已在 WhisperForConditionalGeneration 和 Qwen2AudioForConditionalGeneration 模型类上进行了测试。

参数

  • data_args

    数据集加载的配置设置

  • split

    (str) –

    从数据集中加载的拆分,例如 testtrain[:5%]

  • processor

    (Processor) –

    要在数据集上使用的处理器或分词器

源文件位于 llmcompressor/transformers/data/peoples_speech.py
def __init__(self, dataset_args: "DataArgs", split: str, processor: Processor):
    dataset_args = deepcopy(dataset_args)
    dataset_args.dataset = "MLCommons/peoples_speech"
    dataset_args.dataset_config_name = "test"
    if not dataset_args.overwrite_cache:
        logger.warning(
            "Because audio processors are more complex, dataset mapping functions "
            "vary with model architecture and their results cannot be cached. "
            "Setting overwrite_cache=True"
        )
        dataset_args.overwrite_cache = True
    self.processor_type = processor.__class__.__name__

    super().__init__(dataset_args=dataset_args, split=split, processor=processor)