多模态支持#

本文档将引导您完成扩展基础模型的步骤,使其能够接受多模态输入

1. 更新基础 vLLM 模型#

假设您已经按照这些步骤在 vLLM 中实现了模型。接下来,按照以下步骤更新模型

  • forward() 中为每个对应于多模态输入的输入张量保留一个关键字参数,如下例所示

      def forward(
          self,
          input_ids: torch.Tensor,
          positions: torch.Tensor,
          kv_caches: List[torch.Tensor],
          attn_metadata: AttentionMetadata,
    +     pixel_values: torch.Tensor,
      ) -> SamplerOutput:
    

    更方便的是,您可以简单地将 **kwargs 传递给 forward() 方法,并从中检索多模态输入的关键字参数。

  • 实现 get_multimodal_embeddings() 方法,该方法返回通过模型的多模态 tokenizer 运行多模态输入获得的嵌入向量。下面我们提供一个典型的实现模式的样板代码,但您可以根据自己的需要进行调整。

    class YourModelForImage2Seq(nn.Module):
        ...
    
        def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:
    
            assert self.vision_encoder is not None
            image_features = self.vision_encoder(image_input)
            return self.multi_modal_projector(image_features)
    
        def get_multimodal_embeddings(self, **kwargs: object) -> Optional[NestedTensors]:
    
            # Validate the multimodal input keyword arguments
            image_input = self._parse_and_validate_image_input(**kwargs)
            if image_input is None:
                return None
    
            # Run multimodal inputs through encoder and projector
            vision_embeddings = self._process_image_input(image_input)
            return vision_embeddings
    

    重要提示

    返回的 multimodal_embeddings 必须是 形状为 (num_items, feature_size, hidden_size) 的 3D torch.Tensor,或者 形状为 (feature_size, hidden_size) 的 2D torch.Tensor 的列表/元组,以便 multimodal_embeddings[i] 检索从请求的第 i 个多模态数据项(例如,图像)生成的嵌入向量。

  • 实现 get_input_embeddings() 方法,将 multimodal_embeddings 与来自 input_ids 的文本嵌入向量合并。如果模型的输入处理已正确实现(请参阅以下章节),那么您可以利用我们提供的实用函数轻松合并嵌入向量。

    from .utils import merge_multimodal_embeddings
    
    class YourModelForImage2Seq(nn.Module):
        ...
    
        def get_input_embeddings(
            self,
            input_ids: torch.Tensor,
            multimodal_embeddings: Optional[NestedTensors] = None,
        ) -> torch.Tensor:
    
            # `get_input_embeddings` should already be implemented for the language 
            # model as one of the requirements of basic vLLM model implementation.
            inputs_embeds = self.language_model.get_input_embeddings(input_ids)
    
            if multimodal_embeddings is not None:
                inputs_embeds = merge_multimodal_embeddings(
                    input_ids=input_ids, 
                    inputs_embeds=inputs_embeds, 
                    multimodal_embeddings=multimodal_embeddings,
                    placeholder_token_id=self.config.image_token_index)
    
            return inputs_embeds
    
  • 完成上述步骤后,使用 SupportsMultiModal 接口更新模型类。

    + from vllm.model_executor.models.interfaces import SupportsMultiModal
    
    - class YourModelForImage2Seq(nn.Module):
    + class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
    

    注意

    模型类不必命名为 *ForCausalLM。有关一些示例,请查看 HuggingFace Transformers 文档

2. 指定处理信息#

接下来,创建 BaseProcessingInfo 的子类,以提供与 HF 处理相关的基本信息。

输入项的最大数量#

您需要重写抽象方法 get_supported_mm_limits(),以返回模型支持的每种模态的输入项最大数量。

例如,如果模型支持任意数量的图像,但每个提示只支持一个视频

def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
    return {"image": None, "video": 1}

占位符特征令牌的最大数量#

此外,重写抽象方法 get_mm_max_tokens_per_item(),以返回每种模态的每个输入项的占位符特征令牌的最大数量。

当调用模型时,视觉编码器的输出嵌入向量被分配给包含占位符特征令牌的输入位置。因此,占位符特征令牌的数量应等于输出嵌入向量的大小。

查看 HF 的 LlavaForConditionalGeneration 的代码

# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
n_image_features = image_features.shape[0] * image_features.shape[1]

if n_image_tokens != n_image_features:
    raise ValueError(
        f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
    )
special_image_mask = (
    (input_ids == self.config.image_token_index)
    .unsqueeze(-1)
    .expand_as(inputs_embeds)
    .to(inputs_embeds.device)
)
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)

每张图像的占位符特征令牌数量为 image_features.shape[1]image_featuresget_image_features 方法内部计算

# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)

selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
if vision_feature_select_strategy == "default":
    selected_image_feature = selected_image_feature[:, 1:]
elif vision_feature_select_strategy == "full":
    selected_image_feature = selected_image_feature
else:
    raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
image_features = self.multi_modal_projector(selected_image_feature)
return image_features

我们可以推断出 image_features.shape[1] 基于来自视觉塔 (llava-hf/llava-1.5-7b-hf 模型的 CLIPVisionModel) 的 image_outputs.hidden_states.shape[1]。此外,我们只需要序列长度(张量的第二个维度)即可获得 image_features.shape[1]。序列长度由 CLIPVisionTransformer 中的初始隐藏状态决定,因为注意力机制不会改变输出隐藏状态的序列长度。

# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L1094-L1102
hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
hidden_states = self.pre_layrnorm(hidden_states)

encoder_outputs = self.encoder(
    inputs_embeds=hidden_states,
    output_attentions=output_attentions,
    output_hidden_states=output_hidden_states,
    return_dict=return_dict,
)

为了找到序列长度,我们转向 CLIPVisionEmbeddings 的代码

# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
target_dtype = self.patch_embedding.weight.dtype
patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
patch_embeds = patch_embeds.flatten(2).transpose(1, 2)

class_embeds = self.class_embedding.expand(batch_size, 1, -1)
embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
if interpolate_pos_encoding:
    embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
else:
    embeddings = embeddings + self.position_embedding(self.position_ids)
return embeddings

我们可以推断出 embeddings.shape[1] == self.num_positions,其中

# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L195-L196
self.num_patches = (self.image_size // self.patch_size) ** 2
self.num_positions = self.num_patches + 1

总的来说,一张图像的占位符特征令牌数量可以计算为

def get_num_image_tokens(
    self,
    *,
    image_width: int,
    image_height: int,
) -> int:
    hf_config = self.get_hf_config()
    hf_processor = self.get_hf_processor()

    image_size = hf_config.vision_config.image_size
    patch_size = hf_config.vision_config.patch_size

    num_image_tokens = (image_size // patch_size) ** 2 + 1
    if hf_processor.vision_feature_select_strategy == "default":
        num_image_tokens -= 1

    return num_image_tokens

请注意,图像令牌的数量不取决于图像的宽度和高度。因此,我们可以使用任何图像大小来计算图像令牌的最大数量

def get_image_size_with_most_features(self) -> ImageSize:
    hf_config = self.get_hf_config()
    width = height = hf_config.image_size
    return ImageSize(width=width, height=height)

def get_max_image_tokens(self) -> int:
    target_width, target_height = self.get_image_size_with_most_features()

    return self.get_num_image_tokens(
        image_width=target_width,
        image_height=target_height,
    )

因此,我们可以像这样重写该方法

def get_mm_max_tokens_per_item(
    self,
    seq_len: int,
    mm_counts: Mapping[str, int],
) -> Mapping[str, int]:
    return {"image": self.get_max_image_tokens()}

注意

我们的实际代码更加抽象,以支持除 CLIP 以外的视觉编码器。

查看 HF 的 FuyuForCausalLM 的代码

# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
if image_patches is not None and past_key_values is None:
    patch_embeddings = [
        self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
        .squeeze(0)
        .to(inputs_embeds.device)
        for patch in image_patches
    ]
    inputs_embeds = self.gather_continuous_embeddings(
        word_embeddings=inputs_embeds,
        continuous_embeddings=patch_embeddings,
        image_patch_input_indices=image_patches_indices,
    )

批次中第 i 个项目的占位符特征令牌数量为 patch_embeddings[i].shape[0],这与 image_patches[i].shape[0] 相同,即 num_total_patches

与 LLaVA 不同,Fuyu 未在建模文件中定义补丁的数量。我们可以在哪里获得更多信息?考虑到模型输入来自 FuyuProcessor 的输出,让我们查看预处理文件

图像输出通过在 FuyuProcessor 内部调用 FuyuImageProcessor.preprocess,然后调用 FuyuImageProcessor.preprocess_with_tokenizer_info 获得。

FuyuImageProcessor.preprocess 中,图像被调整大小并填充到目标 FuyuImageProcessor.size,并返回调整大小后(但在填充之前)的尺寸作为元数据。

# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
batch_images = image_encoding["images"]
image_unpadded_heights = image_encoding["image_unpadded_heights"]
image_unpadded_widths = image_encoding["image_unpadded_widths"]

# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
if do_resize:
    batch_images = [
        [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
        for images in batch_images
    ]

image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]

if do_pad:
    batch_images = [
        [
            self.pad_image(
                image,
                size=size,
                mode=padding_mode,
                constant_values=padding_value,
                input_data_format=input_data_format,
            )
            for image in images
        ]
        for images in batch_images
    ]

FuyuImageProcessor.preprocess_with_tokenizer_info 中,图像根据此元数据被分割成补丁。

# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
model_image_input = self.image_processor.preprocess_with_tokenizer_info(
    image_input=tensor_batch_images,
    image_present=image_present,
    image_unpadded_h=image_unpadded_heights,
    image_unpadded_w=image_unpadded_widths,
    image_placeholder_id=image_placeholder_id,
    image_newline_id=image_newline_id,
    variable_sized=True,
)

# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
image_height, image_width = image.shape[1], image.shape[2]
if variable_sized:  # variable_sized=True
    new_h = min(
        image_height,
        math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
    )
    new_w = min(
        image_width,
        math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
    )
    image = image[:, :new_h, :new_w]
    image_height, image_width = new_h, new_w

num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
tensor_of_image_ids = torch.full(
    [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
)
patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
assert num_patches == patches.shape[0]

补丁的数量又由 FuyuImageProcessor.get_num_patches 定义。

# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
patch_size = patch_size if patch_size is not None else self.patch_size
patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]

if image_height % patch_height != 0:
    raise ValueError(f"{image_height=} must be divisible by {patch_height}")
if image_width % patch_width != 0:
    raise ValueError(f"{image_width=} must be divisible by {patch_width}")

num_patches_per_dim_h = image_height // patch_height
num_patches_per_dim_w = image_width // patch_width
num_patches = num_patches_per_dim_h * num_patches_per_dim_w

我们可以使用以下代码在 vLLM 中计算它

def get_num_image_patches(
    self,
    *,
    image_width: int,
    image_height: int,
) -> int:
    image_processor = self.get_image_processor()
    target_width = image_processor.size["width"]
    target_height = image_processor.size["height"]
    patch_width = image_processor.patch_size["width"]
    patch_height = image_processor.patch_size["height"]

    if not (image_width <= target_width and image_height <= target_height):
        height_scale_factor = target_height / image_height
        width_scale_factor = target_width / image_width
        optimal_scale_factor = min(height_scale_factor, width_scale_factor)

        image_height = int(image_height * optimal_scale_factor)
        image_width = int(image_width * optimal_scale_factor)

    ncols = math.ceil(image_width / patch_width)
    nrows = math.ceil(image_height / patch_height)
    return ncols * nrows

这些图像补丁对应于占位符令牌 (|SPEAKER|)。但是,处理器还会插入换行符令牌 (|NEWLINE|),如此处所示

# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L654-L670
tensor_of_image_ids = torch.full(
    [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
)
patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
assert num_patches == patches.shape[0]

if variable_sized:
    # Now terminate each line with |NEWLINE|.
    tensor_of_image_ids = tensor_of_image_ids.reshape(-1, image_width // patch_width)
    newline_ids = torch.full(
        [tensor_of_image_ids.shape[0], 1],
        image_newline_id,
        dtype=torch.int32,
        device=image_input.device,
    )
    tensor_of_image_ids = torch.cat([tensor_of_image_ids, newline_ids], dim=1)
    tensor_of_image_ids = tensor_of_image_ids.reshape(-1)

因此,图像的令牌布局是

|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
...
|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|

这使得提示中的占位符令牌是非连续的。由于 vLLM 要求特征令牌是连续的,我们也将换行符令牌视为特征令牌

因此,总的来说,特征令牌的总数是

def get_num_image_tokens(
    self,
    *,
    image_width: int,
    image_height: int,
) -> int:
    image_processor = self.get_image_processor()
    target_width = image_processor.size["width"]
    target_height = image_processor.size["height"]
    patch_width = image_processor.patch_size["width"]
    patch_height = image_processor.patch_size["height"]

    if not (image_width <= target_width and image_height <= target_height):
        height_scale_factor = target_height / image_height
        width_scale_factor = target_width / image_width
        optimal_scale_factor = min(height_scale_factor, width_scale_factor)

        image_height = int(image_height * optimal_scale_factor)
        image_width = int(image_width * optimal_scale_factor)

    ncols = math.ceil(image_width / patch_width)
    nrows = math.ceil(image_height / patch_height)
    return (ncols + 1) * nrows

要计算图像令牌的最大数量,请回想一下,输入图像首先被调整大小以适应 image_processor.size。因此,在转换为补丁之前,图像的最大可能尺寸等于 image_processor.size

def get_image_size_with_most_features(self) -> ImageSize:
    image_processor = self.get_image_processor()
    return ImageSize(width=image_processor.size["width"],
                        height=image_processor.size["height"])

def get_max_image_tokens(self) -> int:
    target_width, target_height = self.get_image_size_with_most_features()

    return self.get_num_image_tokens(
        image_width=target_width,
        image_height=target_height,
    )

因此,我们可以像这样重写该方法

def get_mm_max_tokens_per_item(
    self,
    seq_len: int,
    mm_counts: Mapping[str, int],
) -> Mapping[str, int]:
    return {"image": self.get_max_image_tokens()}

注意

我们的实际代码直接返回 ncolsnrows,而不是令牌总数。这是因为 ncolsnrows 用于指定特征令牌的布局(如本指南的步骤 4 所示)。

3. 指定虚拟输入#

然后,继承 BaseDummyInputsBuilder 以构建用于 HF 处理以及内存性能分析的虚拟输入。

用于内存性能分析#

重写抽象方法 get_dummy_processor_inputs(),以构建用于内存性能分析的虚拟输入。此虚拟输入应导致模型的最坏情况内存使用,以便 vLLM 可以为其保留正确的内存量。

假设内存使用量随着令牌数量的增加而增加,则虚拟输入可以根据 get_mm_max_tokens_per_item() 的代码构建。

利用步骤 2 中实现的 get_image_size_with_most_features 方法

def get_dummy_processor_inputs(
    self,
    seq_len: int,
    mm_counts: Mapping[str, int],
) -> ProcessorInputs:
    num_images = mm_counts.get("image", 0)

    processor = self.info.get_hf_processor()
    image_token = processor.image_token
  
    hf_config = self.get_hf_config()
    target_width, target_height = self.info.get_image_size_with_most_features()

    mm_data = {
        "image":
        self._get_dummy_images(width=target_width,
                               height=target_height,
                               num_images=num_images)
    }

    return ProcessorInputs(
        prompt_text=image_token * num_images,
        mm_data=mm_data,
    )

Fuyu 不希望 HF 处理器的输入中包含图像占位符,因此无论图像数量多少,虚拟提示文本都为空。否则,此方法的逻辑与 LLaVA 非常相似

def get_dummy_processor_inputs(
    self,
    seq_len: int,
    mm_counts: Mapping[str, int],
) -> ProcessorInputs:
    target_width, target_height = \
        self.info.get_image_size_with_most_features()
    num_images = mm_counts.get("image", 0)

    mm_data = {
        "image":
        self._get_dummy_images(width=target_width,
                                height=target_height,
                                num_images=num_images)
    }

    return ProcessorInputs(
        prompt_text="",
        mm_data=mm_data,
    )

4. 指定处理细节#

之后,创建 BaseMultiModalProcessor 的子类,以填写有关 HF 处理的缺失细节。

另请参阅

多模态数据处理

多模态字段#

重写 _get_mm_fields_config() 以返回 HF 处理器输出的与输入多模态项目相关的张量的模式。

CLIPImageProcessor 的输出是一个简单的张量,形状为 (num_images, num_channels, image_height, image_width)

# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
images = [
    to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
    for image in all_images
]

data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)

因此,我们像这样重写 _get_mm_fields_config()

def _get_mm_fields_config(
    self,
    hf_inputs: BatchFeature,
    hf_processor_mm_kwargs: Mapping[str, object],
) -> Mapping[str, MultiModalFieldConfig]:
    return dict(
        pixel_values=MultiModalFieldConfig.batched("image"),
    )

注意

我们的实际代码还支持预先计算的图像嵌入,可以通过 image_embeds 参数传递给模型。

FuyuImageProcessor.preprocess_with_tokenizer_infoimage_patches 输出连接属于批次中某个项目的每个图像的补丁。

# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L673-L679
        image_input_ids.append(tensor_of_image_ids)
        image_patches.append(patches)
    else:
        image_input_ids.append(torch.tensor([], dtype=torch.int32, device=image_input.device))

batch_image_input_ids.append(image_input_ids)
batch_image_patches.append(image_patches)

因此,FuyuImageProcessor 输出的 image_patches 的形状为 (1, num_images, num_patches, patch_width * patch_height * num_channels)

为了支持像 LLaVA 中那样使用 MultiModalFieldConfig.batched(),我们通过重写 BaseMultiModalProcessor._call_hf_processor() 来删除额外的批次维度。

def _call_hf_processor(
    self,
    prompt: str,
    mm_data: Mapping[str, object],
    mm_kwargs: Mapping[str, object],
) -> BatchFeature:
    processed_outputs = super()._call_hf_processor(
        prompt=prompt,
        mm_data=mm_data,
        mm_kwargs=mm_kwargs,
    )

    image_patches = processed_outputs.get("image_patches")
    if image_patches is not None:
        images = mm_data["images"]
        assert isinstance(images, list)

        # Original output: (1, num_images, Pn, Px * Py * C)
        # New output: (num_images, Pn, Px * Py * C)
        assert (isinstance(image_patches, list)
                and len(image_patches) == 1)
        assert (isinstance(image_patches[0], torch.Tensor)
                and len(image_patches[0]) == len(images))

        processed_outputs["image_patches"] = image_patches[0]

    return processed_outputs

注意

我们的实际代码对纯文本输入进行了特殊处理,以防止 HF 处理器发出不必要的警告。

这使我们可以像这样重写 _get_mm_fields_config()

def _get_mm_fields_config(
    self,
    hf_inputs: BatchFeature,
    hf_processor_mm_kwargs: Mapping[str, object],
) -> Mapping[str, MultiModalFieldConfig]:
    return dict(image_patches=MultiModalFieldConfig.batched("image"))

提示替换#

重写 _get_prompt_replacements() 以返回 PromptReplacement 实例的列表。

每个 PromptReplacement 实例指定 HF 处理器执行的查找和替换操作。

查看 HF 的 LlavaProcessor

# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/processing_llava.py#L167-L170
prompt_strings = []
for sample in text:
    sample = sample.replace(self.image_token, self.image_token * num_image_tokens)
    prompt_strings.append(sample)

它只是简单地重复每个输入 image_token 的次数,次数等于占位符特征令牌的数量 (num_image_tokens)。基于此,我们像这样重写 _get_prompt_replacements()

def _get_prompt_replacements(
    self,
    mm_items: MultiModalDataItems,
    hf_processor_mm_kwargs: Mapping[str, object],
    out_mm_kwargs: MultiModalKwargs,
) -> list[PromptReplacement]:
    hf_config = self.info.get_hf_config()
    image_token_id = hf_config.image_token_index

    def get_replacement(item_idx: int):
        images = mm_items.get_items("image", ImageProcessorItems)

        image_size = images.get_image_size(item_idx)
        num_image_tokens = self.info.get_num_image_tokens(
            image_width=image_size.width,
            image_height=image_size.height,
        )

        return [image_token_id] * num_image_tokens

    return [
        PromptReplacement(
            modality="image",
            target=[image_token_id],
            replacement=get_replacement,
        ),
    ]

回想一下步骤 2 中的特征令牌布局

|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
...
|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|

我们定义一个辅助函数来直接返回 ncolsnrows

def get_image_feature_grid_size(
    self,
    *,
    image_width: int,
    image_height: int,
) -> tuple[int, int]:
    image_processor = self.get_image_processor()
    target_width = image_processor.size["width"]
    target_height = image_processor.size["height"]
    patch_width = image_processor.patch_size["width"]
    patch_height = image_processor.patch_size["height"]

    if not (image_width <= target_width and image_height <= target_height):
        height_scale_factor = target_height / image_height
        width_scale_factor = target_width / image_width
        optimal_scale_factor = min(height_scale_factor, width_scale_factor)

        image_height = int(image_height * optimal_scale_factor)
        image_width = int(image_width * optimal_scale_factor)

    ncols = math.ceil(image_width / patch_width)
    nrows = math.ceil(image_height / patch_height)
    return ncols, nrows

基于此,我们可以初步将我们的替换令牌定义为

def get_replacement(item_idx: int):
    images = mm_items.get_items("image", ImageProcessorItems)
    image_size = images.get_image_size(item_idx)

    ncols, nrows = self.info.get_image_feature_grid_size(
        image_width=image_size.width,
        image_height=image_size.height,
    )

    # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
    # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
    return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows

但是,这并不完全正确。在调用 FuyuImageProcessor.preprocess_with_tokenizer_info 后,BOS 令牌 (<s>) 也被添加到提示中

# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
model_image_input = self.image_processor.preprocess_with_tokenizer_info(
    image_input=tensor_batch_images,
    image_present=image_present,
    image_unpadded_h=image_unpadded_heights,
    image_unpadded_w=image_unpadded_widths,
    image_placeholder_id=image_placeholder_id,
    image_newline_id=image_newline_id,
    variable_sized=True,
)
prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
    tokenizer=self.tokenizer,
    prompts=prompts,
    scale_factors=scale_factors,
    max_tokens_to_generate=self.max_tokens_to_generate,
    max_position_embeddings=self.max_position_embeddings,
    add_BOS=True,
    add_beginning_of_answer_token=True,
)

为了适应这一点,您可以返回 PromptReplacementDetails 的一个实例,而不是字符串,该实例具有不同的 fullfeature 属性

hf_config = self.info.get_hf_config()
bos_token_id = hf_config.bos_token_id  # `<s>`
assert isinstance(bos_token_id, int)

def get_replacement_fuyu(item_idx: int):
    images = mm_items.get_items("image", ImageProcessorItems)
    image_size = images.get_image_size(item_idx)

    ncols, nrows = self.info.get_image_feature_grid_size(
        image_width=image_size.width,
        image_height=image_size.height,
    )
    image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
                    [_NEWLINE_TOKEN_ID]) * nrows

    return PromptReplacementDetails(
        full=image_tokens + [bos_token_id],
        features=image_tokens,
    )

最后,注意到 HF 处理器从令牌化的提示中删除了 |ENDOFTEXT| 令牌,我们可以搜索它以在字符串的开头进行替换

def _get_prompt_replacements(
    self,
    mm_items: MultiModalDataItems,
    hf_processor_mm_kwargs: Mapping[str, object],
    out_mm_kwargs: MultiModalKwargs,
) -> list[PromptReplacement]:
    hf_config = self.info.get_hf_config()
    bos_token_id = hf_config.bos_token_id
    assert isinstance(bos_token_id, int)

    tokenizer = self.info.get_tokenizer()
    eot_token_id = tokenizer.bos_token_id
    assert isinstance(eot_token_id, int)

    def get_replacement_fuyu(item_idx: int):
        images = mm_items.get_items("image", ImageProcessorItems)
        image_size = images.get_image_size(item_idx)

        ncols, nrows = self.info.get_image_feature_grid_size(
            image_width=image_size.width,
            image_height=image_size.height,
        )
        image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
                        [_NEWLINE_TOKEN_ID]) * nrows

        return PromptReplacementDetails(
            full=image_tokens + [bos_token_id],
            features=image_tokens,
        )

    return [
        PromptReplacement(
            modality="image",
            target=[eot_token_id],
            replacement=get_replacement_fuyu,
        )
    ]