Ministral-3 Reasoning 使用指南¶
本指南介绍如何运行带有 BF16 权重和 3 种不同尺寸的 Ministral-3 Reasoning 模型。
- 3B:共享嵌入层和输出层。
- 8B 和 14B:各有独立的嵌入层和输出层。
这些模型均支持视觉功能,并具有高达 256k 的大上下文窗口。
使用较小的模型可以获得更快的推理速度,但性能可能会有所下降。请根据您的需求,选择成本与性能之间的最佳平衡点。
安装 vLLM¶
在 1xH200 上运行 Ministral-3 Reasoning 3B 或 8B¶
由于模型大小,Ministral-3-3B-Reasoning-2512 和 Ministral-3-8B-Reasoning-2512 可以在单个 1xH200 GPU 上运行。
但是,对于没有此代 GPU 的用户,vLLM 会回退到 Marlin FP4,允许您在 NVFP4 中量化运行模型。与 FP8 量化相比,您不会看到速度提升,但仍能受益于内存节省。
关于 GB200 的性能,我们观察到速度显著提升,但在视觉数据集上略有性能下降,这可能是由于校准主要是在文本数据上进行的。
一个简单的启动命令是
# For 3B use `vllm serve mistralai/Ministral-3-3B-Reasoning-2512`
vllm serve mistralai/Ministral-3-8B-Reasoning-2512 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral \
--reasoning-parser mistral
关键参数注意事项
- enable-auto-tool-choice:启用工具使用时必需。
- tool-call-parser mistral:启用工具使用时必需。
- reasoning-parser mistral:启用推理时需要。
附加标志
- 您可以设置
--max-model-len来节省内存。默认情况下,它设置为262144,这相当大,但对于大多数场景来说并非必需。 - 您可以设置
--max-num-batched-tokens来平衡吞吐量和延迟,值越高意味着吞吐量越高但延迟也越高。
在 2xH200 上运行 Ministral-3 Reasoning 14B¶
为充分利用 Ministral-3-14B-Reasoning-2512 的大上下文能力,我们建议使用 2xH200 GPU 进行部署。但是,如果您不需要大上下文,也可以回退到单 GPU。
一个简单的启动命令是
vllm serve mistralai/Ministral-3-14B-Reasoning-2512 \
--tensor-parallel-size 2 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral \
--reasoning-parser mistral
关键参数注意事项
- enable-auto-tool-choice:启用工具使用时必需。
- tool-call-parser mistral:启用工具使用时必需。
- reasoning-parser mistral:启用推理时需要。
附加标志
- 您可以设置
--max-model-len来节省内存。默认情况下,它设置为262144,这相当大,但对于大多数场景来说并非必需。 - 您可以设置
--max-num-batched-tokens来平衡吞吐量和延迟,值越高意味着吞吐量越高但延迟也越高。
模型用法¶
此处我们假设模型 mistralai/Ministral-3-14B-Reasoning-2512 已部署,并且您可以通过端口 8000(vLLM 的默认端口)访问其域名 localhost。
视觉推理¶
让我们看看 Ministral-3 模型是否知道何时该“吵架”!
from typing import Any
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"
TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
index_begin_think = system_prompt.find("[THINK]")
index_end_think = system_prompt.find("[/THINK]")
return {
"role": "system",
"content": [
{"type": "text", "text": system_prompt[:index_begin_think]},
{
"type": "thinking",
"thinking": system_prompt[
index_begin_think + len("[THINK]") : index_end_think
],
"closed": True,
},
{
"type": "text",
"text": system_prompt[index_end_think + len("[/THINK]") :],
},
],
}
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
SYSTEM_PROMPT,
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
stream = client.chat.completions.create(
model=model,
messages=messages,
stream=True,
temperature=TEMP,
top_p=TOP_P,
max_tokens=MAX_TOK,
)
print("client: Start streaming chat completions...:\n")
printed_reasoning_content = False
answer = []
for chunk in stream:
reasoning_content = None
content = None
# Check the content is reasoning_content or content
if hasattr(chunk.choices[0].delta, "reasoning_content"):
reasoning_content = chunk.choices[0].delta.reasoning_content
if hasattr(chunk.choices[0].delta, "content"):
content = chunk.choices[0].delta.content
if reasoning_content is not None:
if not printed_reasoning_content:
printed_reasoning_content = True
print("Start reasoning:\n", end="", flush=True)
print(reasoning_content, end="", flush=True)
elif content is not None:
# Extract and print the content
if not reasoning_content and printed_reasoning_content:
answer.extend(content)
print(content, end="", flush=True)
if answer:
print("\n\n=============\nAnswer\n=============\n")
print("".join(answer))
else:
print("\n\n=============\nNo Answer\n=============\n")
print(
"No answer was generated by the model, probably because the maximum number of tokens was reached."
)
现在,我们将让它进行一些数学计算!
from typing import Any
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"
TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
index_begin_think = system_prompt.find("[THINK]")
index_end_think = system_prompt.find("[/THINK]")
return {
"role": "system",
"content": [
{"type": "text", "text": system_prompt[:index_begin_think]},
{
"type": "thinking",
"thinking": system_prompt[
index_begin_think + len("[THINK]") : index_end_think
],
"closed": True,
},
{
"type": "text",
"text": system_prompt[index_end_think + len("[/THINK]") :],
},
],
}
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://i.ytimg.com/vi/5Y3xLHeyKZU/hqdefault.jpg"
messages = [
SYSTEM_PROMPT,
{
"role": "user",
"content": [
{
"type": "text",
"text": "Solve the equations. If they contain only numbers, use your calculator, else only think. Answer in the language of the image.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
stream = client.chat.completions.create(
model=model,
messages=messages,
stream=True,
temperature=TEMP,
top_p=TOP_P,
max_tokens=MAX_TOK,
)
print("client: Start streaming chat completions...:\n")
printed_reasoning_content = False
answer = []
for chunk in stream:
reasoning_content = None
content = None
# Check the content is reasoning_content or content
if hasattr(chunk.choices[0].delta, "reasoning_content"):
reasoning_content = chunk.choices[0].delta.reasoning_content
if hasattr(chunk.choices[0].delta, "content"):
content = chunk.choices[0].delta.content
if reasoning_content is not None:
if not printed_reasoning_content:
printed_reasoning_content = True
print("Start reasoning:\n", end="", flush=True)
print(reasoning_content, end="", flush=True)
if content is not None:
# Extract and print the content
if not reasoning_content and printed_reasoning_content:
answer.extend(content)
print(content, end="", flush=True)
if answer:
print("\n\n=============\nAnswer\n=============\n")
print("".join(answer))
else:
print("\n\n=============\nNo Answer\n=============\n")
print(
"No answer was generated by the model, probably because the maximum number of tokens was reached."
)
纯文本请求¶
让我们进行更多数学计算,并让模型自行决定如何得出结果。
from typing import Any
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"
TEMP = 0.7
TOP_P = 0.95
MAX_TOK = 262144
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
index_begin_think = system_prompt.find("[THINK]")
index_end_think = system_prompt.find("[/THINK]")
return {
"role": "system",
"content": [
{"type": "text", "text": system_prompt[:index_begin_think]},
{
"type": "thinking",
"thinking": system_prompt[
index_begin_think + len("[THINK]") : index_end_think
],
"closed": True,
},
{
"type": "text",
"text": system_prompt[index_end_think + len("[/THINK]") :],
},
],
}
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
query = "Use each number in 2,5,6,3 exactly once, along with any combination of +, -, ×, ÷ (and parentheses for grouping), to make the number 24."
messages = [
SYSTEM_PROMPT,
{"role": "user", "content": query}
]
stream = client.chat.completions.create(
model=model,
messages=messages,
stream=True,
temperature=TEMP,
top_p=TOP_P,
max_tokens=MAX_TOK,
)
print("client: Start streaming chat completions...:\n")
printed_reasoning_content = False
answer = []
for chunk in stream:
reasoning_content = None
content = None
# Check the content is reasoning_content or content
if hasattr(chunk.choices[0].delta, "reasoning_content"):
reasoning_content = chunk.choices[0].delta.reasoning_content
if hasattr(chunk.choices[0].delta, "content"):
content = chunk.choices[0].delta.content
if reasoning_content is not None:
if not printed_reasoning_content:
printed_reasoning_content = True
print("Start reasoning:\n", end="", flush=True)
print(reasoning_content, end="", flush=True)
if content is not None:
# Extract and print the content
if not reasoning_content and printed_reasoning_content:
answer.extend(content)
print(content, end="", flush=True)
if answer:
print("\n\n=============\nAnswer\n=============\n")
print("".join(answer))
else:
print("\n\n=============\nNo Answer\n=============\n")
print("No answer was generated by the model, probably because the maximum number of tokens was reached.")