跳到内容

vLLM 指南

NVIDIA Nemotron-Nano-12B-v2-VL 用户指南

NVIDIA Nemotron-Nano-12B-v2-VL 用户指南¶

本指南介绍如何在目标加速栈上运行 Nemotron-Nano-12B-v2-VL 系列。

安装 vLLM¶

vLLM 0.11.0 不包含 Nemotron-Nano-12B-v2-VL，因此请从源代码安装或参考此夜间构建版本。
```
docker pull vllm/vllm-openai:nightly-8bff831f0aa239006f34b721e63e1340e3472067
```

部署 Nemotron-Nano-12B-v2-VL¶

服务器:¶

以下命令将在 1 个 GPU 上启动一个推理服务器。

注意: * 示例使用 BF16 精度的模型。我们鼓励您尝试 FP8 和 NVFP4！ * 您可以设置 --max-model-len <长度> (文档) 来节省内存。模型在约 131K 的上下文长度上进行了训练，但除非用例是长上下文视频，否则较小的上下文也适用。 * 您可以设置 --allowed-local-media-path <根目录> (文档) 来限制本地文件的可访问性。

高效视频采样 (EVS)¶

您可以设置 --video-pruning-rate <比例> 来调整视频压缩。在 arXiv 上阅读更多关于 EVS 的信息。

export VLLM_VIDEO_LOADER_BACKEND=opencv
export CHECKPOINT_PATH="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"
export CUDA_VISIBLE_DEVICES=0

python3 -m vllm.entrypoints.openai.api_server \
   --model ${CHECKPOINT_PATH} \
   --trust-remote-code \
   --media-io-kwargs '{"video": {"fps": 2, "num_frames": 128} }' \
   --max-model-len 131072 \
   --data-parallel-size 1 \
   --port 5566 \
   --allowed-local-media-path / \
   --video-pruning-rate 0.75 \
   --served-model-name "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"

客户端 (bash):¶

curl -X 'POST' \
  'http://127.0.0.1:5566/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model": "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
  "messages": [{"role": "user", "content": [{"type": "text", "text": "Describe the video."}, {"type": "video_url", "video_url": {"url": "file:///path/to/video.mp4"}}]}]
}'

客户端 (Python):¶

from openai import OpenAI
client = OpenAI(
    base_url="https://:5566/v1",
    api_key="<ignored>",
)

completion = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
    messages=[
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe the video."
          },
          {
            "type": "video_url",
            "video_url": {
              "url": "file:///path/to/video.mp4"
            }
          }
        ]
      }
    ],
)

print(completion.choices[0].message.content)

completion = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16",
    messages=[
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe the image."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "file:///path/to/image.jpg"
            }
          }
        ]
      }
    ],
)

print(completion.choices[0].message.content)

vLLM `LLM` API¶

注意: * 示例使用 BF16 精度的模型。我们鼓励您尝试 FP8 和 NVFP4！ * 您可以设置 max_model_len <长度> (文档) 来节省内存。模型在约 131K 的上下文长度上进行了训练，但除非用例是长上下文视频，否则较小的上下文也适用。 * 您可以设置 allowed_local_media_path <根目录> (文档) 来限制本地文件的可访问性。

高效视频采样 (EVS)¶

您可以设置 video_pruning_rate <比例> 来调整视频压缩。在 arXiv 上阅读更多关于 EVS 的信息。

使用图片路径¶

from vllm import LLM, SamplingParams

model_path = "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe the image.",
            },
            {
                "type": "image_url",
                "image_url": {"url": f"file:///path/to/image.jpg"},
            },
        ],
    },
]

llm = LLM(
    model_path,
    trust_remote_code=True,
    max_model_len=2**17,  # 131,072
    # '/' is too permissive and used for example; use a specific directory instead
    allowed_local_media_path="/",
)

outputs = llm.chat(
    messages, 
    sampling_params=SamplingParams(temperature=0, max_tokens=1024),
    # configure the number of tiles from 1 to (default) 12
    # note: for videos, the number of tiles must be 1
    mm_processor_kwargs=dict(max_num_tiles=12),
)

for o in outputs:
    print(o.outputs[0].text)

使用视频路径¶

参见高效视频采样 (EVS)：仅影响视频，定义了要剪枝的视频令牌的比例。

import os
os.environ["VLLM_VIDEO_LOADER_BACKEND"] = "opencv"

from vllm import LLM, SamplingParams

model_path = "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Describe the video.",
            },
            {
                "type": "video_url",
                "video_url": {"url": f"file:///path/to/video.mp4"},
            },
        ],
    },
]

# Efficient Video Sampling (EVS): affects videos only, defines how much of the video tokens to prune
# To turn EVS off, use `video_pruning_rate = 0`
video_pruning_rate = 0.75

llm = LLM(
    model_path,
    trust_remote_code=True,
    video_pruning_rate=video_pruning_rate,
    max_model_len=2**17,  # 131,072
    # '/' is too permissive and used for example; use a specific directory instead
    allowed_local_media_path="/",
    media_io_kwargs=dict(video=dict(fps=2, num_frames=128)),
)

outputs = llm.chat(
    messages, sampling_params=SamplingParams(temperature=0, max_tokens=1024)
)

for o in outputs:
    print(o.outputs[0].text)

使用视频张量和自定义采样¶

from vllm import LLM, SamplingParams
import decord
import numpy as np
from transformers.video_utils import VideoMetadata
from transformers import AutoTokenizer

def sample_video_frames(video_path_local, fps=0, nframe=0, nframe_max=-1):
    """
    Sample frames from a video and return them as a numpy array along with metadata.

    Args:
        video_path_local: Path to the video file
        fps: Target frames per second for sampling (if > 0, uses fps-based sampling)
        nframe: Number of frames to sample (used if fps <= 0)
        nframe_max: Maximum number of frames to sample

    Returns:
        tuple: (images, metadata)
        - images: A numpy array of the sampled frame images.
        - metadata: VideoMetadata dataclass containing info about the sampled frames:
            - total_num_frames: Number of sampled frames
            - fps: Effective frame rate of the sampled frames
            - duration: Duration covered by the sampled frames (in seconds)
            - video_backend: Backend used for video processing ('opencv_dynamic')
    """

    vid = decord.VideoReader(video_path_local)
    total_frames = len(vid)
    video_fps = vid.get_avg_fps()
    total_duration = total_frames / max(1e-6, video_fps)

    if fps > 0:
        required_frames = int(total_duration * fps)
        desired_frames = max(1, required_frames)
        if nframe_max > 0 and desired_frames > nframe_max:
            desired_frames = nframe_max
        if desired_frames >= total_frames:
            indices = list(range(total_frames))
        elif desired_frames == 1:
            indices = [0]  # Always use first frame for single frame sampling
        else:
            # Generate evenly spaced indices and ensure uniqueness
            raw_indices = np.linspace(0, total_frames - 1, desired_frames)
            indices = list(np.unique(np.round(raw_indices).astype(int)))
    else:
        desired_frames = max(1, int(nframe) if nframe and nframe > 0 else 8)
        if nframe_max > 0 and desired_frames > nframe_max:
            desired_frames = nframe_max
        if desired_frames >= total_frames:
            indices = list(range(total_frames))
        elif desired_frames == 1:
            indices = [0]  # Always use first frame for single frame sampling
        else:
            # Generate evenly spaced indices and ensure uniqueness
            raw_indices = np.linspace(0, total_frames - 1, desired_frames)
            indices = list(np.unique(np.round(raw_indices).astype(int)))

    images = vid.get_batch(indices).asnumpy()

    # Calculate timestamps for each sampled frame
    timestamps = [float(idx) / video_fps for idx in indices]

    # Calculate metadata for the sampled frames
    sampled_num_frames = len(indices)

    # Duration is the time span from first to last frame
    if len(timestamps) > 1:
        sampled_duration = timestamps[-1] - timestamps[0]
        sampled_fps = (
            (sampled_num_frames - 1) / sampled_duration if sampled_duration > 0 else 1.0
        )
    else:
        # Single frame case
        sampled_duration = None
        sampled_fps = None

    metadata = VideoMetadata(
        total_num_frames=sampled_num_frames,
        fps=sampled_fps,
        duration=sampled_duration,
        frames_indices=indices,
        video_backend="opencv_dynamic",
    )

    return images, metadata


def main():
    model_path = "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16"
    video_path = "/path/to/video.mp4"

    examples = {
        "8fps_max128frames": dict(fps=8, nframe_max=128),
        "2fps": dict(fps=2),
        "16frames": dict(nframe=16),
    }

    examples = {
        k: sample_video_frames_to_data_urls(video_path, **kwargs)
        for k, kwargs in examples.items()
    }

    for k, (vid, meta) in examples.items():
        print(f"key={k}, {vid.shape=}, {vid.max().item()=}, {vid.min().item()=}")

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe the video.",
                },
                # Note: we add a placeholder of type 'video' so that the tokenizer will insert <video> when it is tokenizing the prompt
                {
                    "type": "video",
                    "text": None,
                }
            ],
        },
    ]

    # Efficient Video Sampling (EVS): affects videos only, defines how much of the video tokens to prune
    # To turn EVS off, use `video_pruning_rate = 0`
    video_pruning_rate = 0.75

    llm = LLM(
        model_path,
        trust_remote_code=True,
        video_pruning_rate=video_pruning_rate,
        max_model_len=2**17,  # 131,072
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    print(f"Prompt: {prompt}")

    outputs = llm.generate(
        [
            {
                "prompt": prompt,
                "multi_modal_data": {"video": (vid, metadata)},
            }
            for (vid, metadata) in examples.values()
        ],
        sampling_params=SamplingParams(temperature=0, max_tokens=1024),
    )

    for k, o in zip(examples.keys(), outputs):
        print(k)
        print(o.outputs[0].text)
        print("-" * 10)


if __name__ == "__main__":
    main()