Seed-OSS-36B 使用指南¶

本指南介绍如何使用 vLLM 和原生 BF16 精度运行 Seed-OSS-36B 模型。Seed-OSS 具有独特的“思考预算”功能，可实现受控推理，并支持高达 512K 的上下文长度。

安装 vLLM¶

Seed-OSS 支持已于近期添加到 vLLM 主分支，但尚未包含在任何官方发布版本中。

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto

您可能需要下载最新版本的 transformer 以确保兼容性。

uv pip install git+https://github.com/huggingface/transformers.git@56d68c6706ee052b445e1e476056ed92ac5eb383

使用 BF16 运行 Seed-OSS-36B¶

有两种方法可以在多个 GPU 上并行化模型：(1) 张量并行或 (2) 数据并行。每种方法都有其自身的优点，其中张量并行通常更有利于低延迟/低负载场景，而数据并行更适用于数据量大且负载重的情况。

按如下方式运行张量并行

vllm serve ByteDance-Seed/Seed-OSS-36B-Instruct \
    --host localhost \
    --port 8000 \
    --tensor-parallel-size 8 \
    --enable-auto-tool-choice \
    --tool-call-parser seed_oss \

您可以设置 --max-model-len 来节省内存。--max-model-len=65536 通常适用于大多数场景，最大值为 512k。
您可以设置 --max-num-batched-tokens 来平衡吞吐量和延迟，数值越高意味着吞吐量越高但延迟也越高。--max-num-batched-tokens=32768 通常适用于提示密集型工作负载。但您可以将其减少到 16k 和 8k 以减少激活内存使用并降低延迟。
vLLM 保守地使用 90% 的 GPU 内存，您可以设置 --gpu-memory-utilization=0.95 来最大化 KVCache。
请务必按照命令行说明操作，以确保工具调用功能已正确启用。

思考预算功能¶

用户可以灵活地指定模型的思考预算。对于更简单的任务（如 IF Eval），模型的思维链（CoT）较短，随着思考预算的增加，得分会出现波动。对于更具挑战性的任务（如 AIME 和 LiveCodeBench），模型的 CoT 较长，得分会随着思考预算的增加而提高。

如果未设置思考预算（默认模式），Seed-OSS 将以无限长度开始思考。如果指定了思考预算，建议用户优先选择 512 的整数倍值（例如，512、1K、2K、4K、8K 或 16K），因为模型在这些区间上进行了广泛的训练。当思考预算为 0 时，模型会被指示输出直接响应，我们建议将任何低于 512 的预算值设置为此值。

使用示例¶

OpenAI 客户端用法¶

您可以按如下方式使用 OpenAI 客户端。您可以通过 extra_body 传递 thinking_budget 来控制思考预算。

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "https://:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?"}
]
extra_body = {"chat_template_kwargs": {"thinking_budget": 512}}
response = client.chat.completions.create(
    model=model, messages=messages, extra_body=extra_body
)
content = response.choices[0].message.content
print("content:\n", content)

示例输出¶

thinking_budget = 512:

content:
 <seed:think>
Got it, let's try to figure out this problem step by step. First, the question is about Janet's ducks laying eggs, and we need to find out how much money she makes at the farmers' market each day. 
<seed:cot_budget_reflect>I have used 138 tokens, and there are 374 tokens remaining for use.</seed:cot_budget_reflect>
 Let's start by listing out the information given. 

First, her ducks lay 16 eggs per day. That's the total number of eggs she has each day, right? Then, she does a few things with these eggs: she eats three for breakfast every morning, bakes muffins with four every day, and sells the remainder at the farmers' market. Each of those sold eggs is $2, so we need to find the remainder first and then multiply by 2 to get the daily earnings.
<seed:cot_budget_reflect>I have used 260 tokens, and there are 252 tokens remaining for use.</seed:cot_budget_reflect>

Let me write that down. Total eggs: 16. Eggs used: eaten (3) plus muffins (4). So first, let's add up how many eggs she uses each day. 3 + 4 = 7 eggs used. Then the remainder is total eggs minus used eggs, so 16 - 7 = 9 eggs left to sell. Wait, is that right? Let me check again. 16 total, subtract 3 eaten, that's 13 left, then subtract 4 for muffins, that's 13 - 4 = 9.
<seed:cot_budget_reflect>I have used 395 tokens, and there are 117 tokens remaining for use.</seed:cot_budget_reflect>
 Yep, that's 9 eggs. Then she sells each for $2, so 9 times 2 is $18. That seems straightforward. Let me make sure I didn't miss anything. The problem says "daily," so we don't have to worry about anything over multiple days. Just one day: 16 eggs, use 3+4=7, sell 9, 9*2=18.
<seed:cot_budget_reflect>I have exhausted my token budget, and now I will start answering the question.</seed:cot_budget_reflect>
</seed:think>To determine how much Janet makes at the farmers' market daily, follow these steps:

### Step 1: Calculate total eggs laid daily  
Janet’s ducks lay **16 eggs per day**.

### Step 2: Calculate eggs used daily  
- She eats 3 eggs for breakfast.  
- She uses 4 eggs for muffins.  
Total eggs used = \(3 + 4 = 7\) eggs.

### Step 3: Find the number of eggs sold  
Remaining eggs = Total eggs - Eggs used = \(16 - 7 = 9\) eggs.

### Step 4: Calculate daily earnings  
She sells each egg for $2, so total earnings = \(9 \times 2 = 18\) dollars.

**Answer:** 18

thinking_budget = 0:

content:
 The current thinking budget is 0, so I will directly start answering the question.</seed:cot_budget_reflect>
</seed:think>To determine how much Janet makes daily at the farmers' market, follow these steps:

### Step 1: Calculate total eggs laid  
Janet’s ducks lay **16 eggs per day**.

### Step 2: Calculate eggs used  
- She eats 3 eggs for breakfast.  
- She uses 4 eggs for muffins.  
- Total eggs used: \(3 + 4 = 7\) eggs.  

### Step 3: Find remaining eggs for sale  
Subtract used eggs from total eggs:  
\(16 - 7 = 9\) eggs.  

### Step 4: Calculate daily earnings  
She sells each remaining egg for $2:  
\(9 \times 2 = 18\) dollars.  

**Answer:** 18

curl 用法¶

curl https://:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ByteDance-Seed/Seed-OSS-36B-Instruct",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "chat_template_kwargs": {
        "thinking_budget": 512
    }
  }'

基准测试¶

我们使用以下脚本在 RTX 3090 GPU 上对 ByteDance-Seed/Seed-OSS-36B-Instruct 进行了基准测试。

vllm bench serve \
    --backend vllm \
    --model ByteDance-Seed/Seed-OSS-36B-Instruct \
    --endpoint /v1/completions \
    --host localhost \
    --port 8000 \
    --dataset-name random \
    --random-input 800 \
    --random-output 100 \
    --request-rate 2 \
    --num-prompt 100 \

示例输出

============ Serving Benchmark Result ============
Successful requests:                     100       
Request rate configured (RPS):           2.00      
Benchmark duration (s):                  54.08     
Total input tokens:                      79934     
Total generated tokens:                  10000     
Request throughput (req/s):              1.85      
Output token throughput (tok/s):         184.92    
Total Token throughput (tok/s):          1663.06   
---------------Time to First Token----------------
Mean TTFT (ms):                          97.96     
Median TTFT (ms):                        99.71     
P99 TTFT (ms):                           128.60    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.39     
Median TPOT (ms):                        43.74     
P99 TPOT (ms):                           49.19     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.39     
Median ITL (ms):                         46.18     
P99 ITL (ms):                            64.52     
==================================================