跳到内容

llmcompressor.transformers.data.flickr_30k

  • Flickr30K

    :param dataset_args: 数据集加载的配置设置

Flickr30K

Flickr30K(
    dataset_args: DatasetArguments,
    split: str,
    processor: Processor,
)

基类:TextGenerationDataset

参数

  • dataset_args

    (DatasetArguments) –

    数据集加载的配置设置

  • split

    (str) –

    从数据集中加载的拆分,例如 testtrain[:5%]

  • processor

    (Processor) –

    要在数据集上使用的处理器或分词器

源代码在 llmcompressor/transformers/data/flickr_30k.py
def __init__(
    self, dataset_args: "DatasetArguments", split: str, processor: Processor
):
    dataset_args = deepcopy(dataset_args)
    dataset_args.dataset = "lmms-lab/flickr30k"

    super().__init__(dataset_args=dataset_args, split=split, processor=processor)

    if (
        self.tokenizer is not None
        and getattr(self.tokenizer, "chat_template", None) is None
    ):
        # note that since tokenizer is a member of processor,
        # this change affects processor.apply_chat_template
        self.tokenizer.chat_template = self.DEFAULT_CHAT_TEMPLATE
        logger.warning(
            "tokenizer.chat_template is not set, using default chat template for "
            f"{self.__class__.__name__}"
        )