Mistral-Large-3 Instruct 使用指南¶
本指南介绍了如何运行带有 NVFP4 和 FP8 权重的 Mistral-Large-3-675B-Instruct-2512。
以下是不同格式的链接:- FP8(上下文长度高达 256k):mistralai/Mistral-Large-3-675B-Instruct-2512 - NVFP4(用于 < 64k 上下文长度):mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4
安装 vLLM¶
运行模型¶
在 8xH200 上运行 Mistral-Large-3-Instruct FP8¶
Mistral-Large-3-Instruct FP8 格式可以在一个 8xH200 节点上使用。如果您计划进行微调,我们建议使用此格式,因为它在某些情况下可能比 NVFP4 更精确。
一个简单的启动命令是
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512 \
--tensor-parallel-size 8 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral
关键参数注意事项
- enable-auto-tool-choice:启用工具使用时必需。
- tool-call-parser mistral:启用工具使用时必需。
附加标志
- 您可以设置
--max-model-len来节省内存。默认情况下,它设置为262144,这相当大,但对于大多数场景来说并非必需。 - 您可以设置
--max-num-batched-tokens来平衡吞吐量和延迟,值越高意味着吞吐量越高但延迟也越高。
在 4xB200 上运行 Mistral-Large-3-Instruct NVFP4¶
如果您计划部署 Mistral-Large-3,我们建议使用此格式,因为它在内存占用更少的情况下实现了与 FP8 相似的性能。但是请注意,对于大上下文(> 64k),我们观察到性能有所下降。在这种情况下,请使用 FP8 权重。否则,在 B200(Blackwell 200)上,我们观察到显著的速度提升,以及在视觉数据集上轻微的性能下降,这可能是由于校准主要针对文本数据进行的。
为了充分利用 NVFP4 格式,请考虑使用 B200 GPU,它具有专用的架构来最大化其性能。对于无法使用此代 GPU 的用户,vLLM 具有出色的功能,可以回退到 Marlin FP4,让您仍然可以在旧代(A100、H100)上运行量化模型。您不会注意到与 FP8 量化相比的速度提升,但仍然可以获得内存优势。
一个简单的启动命令是
vllm serve mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 \
--tensor-parallel-size 4 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral
关键参数注意事项
- enable-auto-tool-choice:启用工具使用时必需。
- tool-call-parser mistral:启用工具使用时必需。
附加标志
- 您可以设置
--max-model-len来节省内存。默认情况下,它设置为262144,这相当大,但对于大多数场景来说并非必需。 - 您可以设置
--max-num-batched-tokens来平衡吞吐量和延迟,值越高意味着吞吐量越高但延迟也越高。 - 您可以设置
--limit-mm-per-prompt.image 0来跳过加载视觉编码器,以便在仅用于文本任务的模型中获得额外的 KV 缓存空间。
模型用法¶
这里我们假设模型 mistralai/Mistral-Large-3-675B-Instruct-2512 已经被服务,并且您可以将其ping到 localhost 域,端口为 8000,这是 vLLM 的默认端口。
视觉推理¶
让我们看看 ML3 模型是否知道何时介入战斗!
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"
TEMP = 0.15
MAX_TOK = 262144
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
max_tokens=MAX_TOK,
)
print(response.choices[0].message.content)
函数调用¶
让我们通过我们简单的 Python 计算器工具来解决一些方程式。
import json
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"
TEMP = 0.15
MAX_TOK = 262144
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
return system_prompt
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"
def my_calculator(expression: str) -> str:
# WARNING: Using eval() with untrusted input is a security risk. For production, use a safer expression evaluator.
return str(eval(expression))
tools = [
{
"type": "function",
"function": {
"name": "my_calculator",
"description": "A calculator that can evaluate a mathematical equation and compute its results.",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "The mathematical expression to evaluate.",
},
},
"required": ["expression"],
},
},
},
{
"type": "function",
"function": {
"name": "rewrite",
"description": "Rewrite a given text for improved clarity",
"parameters": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The input text to rewrite",
}
},
},
},
},
]
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
},
{
"type": "image_url",
"image_url": {
"url": image_url,
},
},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
max_tokens=MAX_TOK,
tools=tools,
tool_choice="auto",
)
tool_calls = response.choices[0].message.tool_calls
results = []
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = tool_call.function.arguments
if function_name == "my_calculator":
result = my_calculator(**json.loads(function_args))
results.append(result)
messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
messages.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call.function.name,
"content": result,
}
)
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
max_tokens=MAX_TOK,
)
print(response.choices[0].message.content)
纯文本请求¶
Mistral-Large-3 可以一丝不苟地遵循您的指示。
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"
TEMP = 0.15
MAX_TOK = 262144
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
return system_prompt
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
max_tokens=MAX_TOK,
)
assistant_message = response.choices[0].message.content
print(assistant_message)