跳到内容

Ministral-3 Reasoning 使用指南

本指南介绍如何运行带有 BF16 权重和 3 种不同尺寸的 Ministral-3 Reasoning 模型。

  • 3B:共享嵌入层和输出层。
  • 8B 和 14B:各有独立的嵌入层和输出层。

这些模型均支持视觉功能,并具有高达 256k 的大上下文窗口。

使用较小的模型可以获得更快的推理速度,但性能可能会有所下降。请根据您的需求,选择成本与性能之间的最佳平衡点。

安装 vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

在 1xH200 上运行 Ministral-3 Reasoning 3B 或 8B

由于模型大小,Ministral-3-3B-Reasoning-2512Ministral-3-8B-Reasoning-2512 可以在单个 1xH200 GPU 上运行。

但是,对于没有此代 GPU 的用户,vLLM 会回退到 Marlin FP4,允许您在 NVFP4 中量化运行模型。与 FP8 量化相比,您不会看到速度提升,但仍能受益于内存节省。

关于 GB200 的性能,我们观察到速度显著提升,但在视觉数据集上略有性能下降,这可能是由于校准主要是在文本数据上进行的。

一个简单的启动命令是

# For 3B use `vllm serve mistralai/Ministral-3-3B-Reasoning-2512`
vllm serve mistralai/Ministral-3-8B-Reasoning-2512 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-auto-tool-choice --tool-call-parser mistral \
  --reasoning-parser mistral

关键参数注意事项

  • enable-auto-tool-choice:启用工具使用时必需。
  • tool-call-parser mistral:启用工具使用时必需。
  • reasoning-parser mistral:启用推理时需要。

附加标志

  • 您可以设置 --max-model-len 来节省内存。默认情况下,它设置为 262144,这相当大,但对于大多数场景来说并非必需。
  • 您可以设置 --max-num-batched-tokens 来平衡吞吐量和延迟,值越高意味着吞吐量越高但延迟也越高。

在 2xH200 上运行 Ministral-3 Reasoning 14B

为充分利用 Ministral-3-14B-Reasoning-2512 的大上下文能力,我们建议使用 2xH200 GPU 进行部署。但是,如果您不需要大上下文,也可以回退到单 GPU。

一个简单的启动命令是

vllm serve mistralai/Ministral-3-14B-Reasoning-2512 \
  --tensor-parallel-size 2 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-auto-tool-choice --tool-call-parser mistral \
  --reasoning-parser mistral

关键参数注意事项

  • enable-auto-tool-choice:启用工具使用时必需。
  • tool-call-parser mistral:启用工具使用时必需。
  • reasoning-parser mistral:启用推理时需要。

附加标志

  • 您可以设置 --max-model-len 来节省内存。默认情况下,它设置为 262144,这相当大,但对于大多数场景来说并非必需。
  • 您可以设置 --max-num-batched-tokens 来平衡吞吐量和延迟,值越高意味着吞吐量越高但延迟也越高。

模型用法

此处我们假设模型 mistralai/Ministral-3-14B-Reasoning-2512 已部署,并且您可以通过端口 8000(vLLM 的默认端口)访问其域名 localhost

视觉推理

让我们看看 Ministral-3 模型是否知道何时该“吵架”!

from typing import Any

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"

TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()

    index_begin_think = system_prompt.find("[THINK]")
    index_end_think = system_prompt.find("[/THINK]")

    return {
        "role": "system",
        "content": [
            {"type": "text", "text": system_prompt[:index_begin_think]},
            {
                "type": "thinking",
                "thinking": system_prompt[
                    index_begin_think + len("[THINK]") : index_end_think
                ],
                "closed": True,
            },
            {
                "type": "text",
                "text": system_prompt[index_end_think + len("[/THINK]") :],
            },
        ],
    }


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    SYSTEM_PROMPT,
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]


stream = client.chat.completions.create(
    model=model,
    messages=messages,
    stream=True,
    temperature=TEMP,
    top_p=TOP_P,
    max_tokens=MAX_TOK,
)

print("client: Start streaming chat completions...:\n")
printed_reasoning_content = False
answer = []

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoning_content = chunk.choices[0].delta.reasoning_content
    if hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

    if reasoning_content is not None:
        if not printed_reasoning_content:
            printed_reasoning_content = True
            print("Start reasoning:\n", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    elif content is not None:
        # Extract and print the content
        if not reasoning_content and printed_reasoning_content:
            answer.extend(content)
        print(content, end="", flush=True)

if answer:
    print("\n\n=============\nAnswer\n=============\n")
    print("".join(answer))
else:
    print("\n\n=============\nNo Answer\n=============\n")
    print(
        "No answer was generated by the model, probably because the maximum number of tokens was reached."
    )

现在,我们将让它进行一些数学计算!

from typing import Any

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"

TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()

    index_begin_think = system_prompt.find("[THINK]")
    index_end_think = system_prompt.find("[/THINK]")

    return {
        "role": "system",
        "content": [
            {"type": "text", "text": system_prompt[:index_begin_think]},
            {
                "type": "thinking",
                "thinking": system_prompt[
                    index_begin_think + len("[THINK]") : index_end_think
                ],
                "closed": True,
            },
            {
                "type": "text",
                "text": system_prompt[index_end_think + len("[/THINK]") :],
            },
        ],
    }


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

image_url = "https://i.ytimg.com/vi/5Y3xLHeyKZU/hqdefault.jpg"

messages = [
    SYSTEM_PROMPT,
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Solve the equations. If they contain only numbers, use your calculator, else only think. Answer in the language of the image.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]

stream = client.chat.completions.create(
    model=model,
    messages=messages,
    stream=True,
    temperature=TEMP,
    top_p=TOP_P,
    max_tokens=MAX_TOK,
)

print("client: Start streaming chat completions...:\n")
printed_reasoning_content = False
answer = []

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoning_content = chunk.choices[0].delta.reasoning_content
    if hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

    if reasoning_content is not None:
        if not printed_reasoning_content:
            printed_reasoning_content = True
            print("Start reasoning:\n", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    if content is not None:
        # Extract and print the content
        if not reasoning_content and printed_reasoning_content:
            answer.extend(content)
        print(content, end="", flush=True)

if answer:
    print("\n\n=============\nAnswer\n=============\n")
    print("".join(answer))
else:
    print("\n\n=============\nNo Answer\n=============\n")
    print(
        "No answer was generated by the model, probably because the maximum number of tokens was reached."
    )

纯文本请求

让我们进行更多数学计算,并让模型自行决定如何得出结果。

from typing import Any
from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"

TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()

    index_begin_think = system_prompt.find("[THINK]")
    index_end_think = system_prompt.find("[/THINK]")

    return {
        "role": "system",
        "content": [
            {"type": "text", "text": system_prompt[:index_begin_think]},
            {
                "type": "thinking",
                "thinking": system_prompt[
                    index_begin_think + len("[THINK]") : index_end_think
                ],
                "closed": True,
            },
            {
                "type": "text",
                "text": system_prompt[index_end_think + len("[/THINK]") :],
            },
        ],
    }


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

query = "Use each number in 2,5,6,3 exactly once, along with any combination of +, -, ×, ÷ (and parentheses for grouping), to make the number 24."

messages = [
    SYSTEM_PROMPT,
    {"role": "user", "content": query}
]
stream = client.chat.completions.create(
  model=model,
  messages=messages,
  stream=True,
  temperature=TEMP,
  top_p=TOP_P,
  max_tokens=MAX_TOK,
)

print("client: Start streaming chat completions...:\n")
printed_reasoning_content = False
answer = []

for chunk in stream:
    reasoning_content = None
    content = None
    # Check the content is reasoning_content or content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        reasoning_content = chunk.choices[0].delta.reasoning_content
    if hasattr(chunk.choices[0].delta, "content"):
        content = chunk.choices[0].delta.content

    if reasoning_content is not None:
        if not printed_reasoning_content:
            printed_reasoning_content = True
            print("Start reasoning:\n", end="", flush=True)
        print(reasoning_content, end="", flush=True)
    if content is not None:
        # Extract and print the content
        if not reasoning_content and printed_reasoning_content:
            answer.extend(content)
        print(content, end="", flush=True)

if answer:
    print("\n\n=============\nAnswer\n=============\n")
    print("".join(answer))
else:
    print("\n\n=============\nNo Answer\n=============\n")
    print("No answer was generated by the model, probably because the maximum number of tokens was reached.")