分布式DP服务器与大规模专家并行#
入门指南#
vLLM-Ascend现在支持大规模专家并行 (EP)场景下的预填-解码 (PD) 分离。为了获得更好的性能,vLLM-Ascend采用了分布式DP服务器。在PD分离场景下,可以根据PD节点的特性差异实现不同的优化策略,从而实现更灵活的模型部署。
以deepseek模型为例,使用8台Atlas 800T A3服务器进行模型部署。假设服务器的IP从192.0.0.1开始,到192.0.0.8结束。使用前4台服务器作为预填节点,后4台服务器作为解码节点。预填节点独立部署为主节点,解码节点将192.0.0.5节点设为主节点。
验证多节点通信环境#
物理层要求:#
物理机必须位于同一局域网内,且网络连通。
所有NPU必须互连。对于Atlas A2代,节点内连接通过HCCS,节点间连接通过RDMA。对于Atlas A3代,节点内和节点间连接都通过HCCS。
验证流程:#
单节点验证
在每个节点上依次执行以下命令。结果必须全部为 success 且状态必须为 UP
# Check the remote switch ports
for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..15}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
获取 NPU IP 地址
for i in {0..15}; do hccn_tool -i $i -vnic -g;done
获取superpodid和SDID
for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done
跨节点PING测试
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done
单节点验证
在每个节点上依次执行以下命令。结果必须全部为 success 且状态必须为 UP
# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf
获取 NPU IP 地址
for i in {0..7}; do hccn_tool -i $i -ip -g;done
跨节点PING测试
# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done
大规模EP模型部署#
生成带有配置的脚本#
在PD分离场景下,我们提供了一个优化的配置。您可以使用以下shell脚本分别配置预填和解码节点。
# run_dp_template.sh
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"
# basic configuration for HCCL and connection
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=256
# obtain parameters from distributed DP server
export VLLM_DP_SIZE=$1
export VLLM_DP_MASTER_IP=$2
export VLLM_DP_MASTER_PORT=$3
export VLLM_DP_RANK_LOCAL=$4
export VLLM_DP_RANK=$5
export VLLM_DP_SIZE_LOCAL=$7
#pytorch_npu settings and vllm settings
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export VLLM_USE_MODELSCOPE="True"
# enable the distributed DP server
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
--host 0.0.0.0 \
--port $6 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_r1 \
--max-model-len 17000 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--max-num-seqs 4 \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--enforce-eager \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
}'
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
# run_dp_template.sh
#!/bin/sh
# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"
# basic configuration for HCCL and connection
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024
# obtain parameters from distributed DP server
export VLLM_DP_SIZE=$1
export VLLM_DP_MASTER_IP=$2
export VLLM_DP_MASTER_PORT=$3
export VLLM_DP_RANK_LOCAL=$4
export VLLM_DP_RANK=$5
export VLLM_DP_SIZE_LOCAL=$7
#pytorch_npu settings and vllm settings
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export VLLM_USE_MODELSCOPE="True"
# enable the distributed DP server
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1
# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
--host 0.0.0.0 \
--port $6 \
--tensor-parallel-size 1 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name deepseek_r1 \
--max-model-len 17000 \
--max-num-batched-tokens 256 \
--trust-remote-code \
--max-num-seqs 28 \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
}' \
--additional-config '{"enable_weight_nz_layout":true}'
启动分布式DP服务器以实现预填-解码分离#
在所有节点上执行以下Python文件以使用分布式DP服务器。(我们推荐在v0.9.1官方版本上使用此功能)
import multiprocessing
import os
import sys
dp_size = 2 # total number of DP engines for decode/prefill
dp_size_local = 2 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node
# dp_ip is different on prefiller nodes in this example
dp_ip = "192.0.0.1" # master node ip for DP communication
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
print(f"Template file {template_path} does not exist.")
sys.exit(1)
def run_command(dp_rank_local, dp_rank, engine_port_):
command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
os.system(command)
processes = []
for i in range(dp_size_local):
dp_rank = dp_rank_start + i
dp_rank_local = i
engine_port_ = engine_port + i
process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
processes.append(process)
process.start()
for process in processes:
process.join()
import multiprocessing
import os
import sys
dp_size = 64 # total number of DP engines for decode/prefill
dp_size_local = 16 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node. e.g. 0/16/32/48
# dp_ip is the same on decoder nodes in this example
dp_ip = "192.0.0.5" # master node ip for DP communication.
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
print(f"Template file {template_path} does not exist.")
sys.exit(1)
def run_command(dp_rank_local, dp_rank, engine_port_):
command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
os.system(command)
processes = []
for i in range(dp_size_local):
dp_rank = dp_rank_start + i
dp_rank_local = i
engine_port_ = engine_port + i
process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
processes.append(process)
process.start()
for process in processes:
process.join()
请注意,预填节点和解码节点可能具有不同的配置。在此示例中,每个预填节点独立部署为主节点,但所有解码节点都将第一个节点视为主节点。因此,在‘dp_size_local’和‘dp_rank_start’上存在差异。
分布式DP服务器的示例代理#
在PD分离场景下,我们需要一个代理来分发请求。执行以下命令以启用示例代理。
python load_balance_proxy_server_example.py \
--port 8000 \
--host 0.0.0.0 \
--prefiller-hosts \
192.0.0.1 \
192.0.0.2 \
192.0.0.3 \
192.0.0.4 \
--prefiller-hosts-num \
2 2 2 2 \
--prefiller-ports \
9000 9000 9000 9000 \
--prefiller-ports-inc \
2 2 2 2\
--decoder-hosts \
192.0.0.5 \
192.0.0.6 \
192.0.0.7 \
192.0.0.8 \
--decoder-hosts-num \
16 16 16 16 \
--decoder-ports \
9000 9000 9000 9000 \
--decoder-ports-inc \
16 16 16 16 \
参数 |
含义 |
|---|---|
–port |
代理服务端口 |
–host |
代理服务主机IP |
–prefiller-hosts |
预填节点的主机 |
–prefiller-hosts-num |
预填节点主机重复次数 |
–prefiller-ports |
预填节点的端口 |
–prefiller-ports-inc |
预填节点端口的增量 |
–decoder-hosts |
解码节点的主机 |
–decoder-hosts-num |
解码节点主机重复次数 |
–decoder-ports |
解码节点端口 |
–decoder-ports-inc |
解码节点端口的增量 |
您可以在仓库的examples中找到代理程序,load_balance_proxy_server_example.py
基准测试#
我们建议使用aisbench工具来评估性能。aisbench 执行以下命令安装aisbench
git clone https://gitee.com/aisbench/benchmark.git
cd benchmark/
pip3 install -e ./
在评估性能之前,您需要取消http代理,如下所示。
# unset proxy
unset http_proxy
unset https_proxy
您可以将数据集放在目录:
benchmark/ais_bench/datasets您可以在目录中更改配置:
benchmark/ais_bench/benchmark/configs/models/vllm_api以vllm_api_stream_chat.py为例。
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
path="vllm-ascend/DeepSeek-R1-W8A8",
model="dsr1",
request_rate = 28,
retry = 2,
host_ip = "192.0.0.1", # Proxy service host IP
host_port = 8000, # Proxy service Port
max_out_len = 10,
batch_size=1536,
trust_remote_code=True,
generation_kwargs = dict(
temperature = 0,
seed = 1024,
ignore_eos=False,
)
)
]
以gsm8k数据集为例,执行以下命令评估性能。
ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --mode perf
有关aisbench的命令和参数的更多详细信息,请参阅aisbench。
预填与解码配置详情#
在PD分离场景下,我们提供了一个优化的配置。
预填节点
设置 HCCL_BUFFSIZE=256
在‘vllm serve’中添加‘–enforce-eager’命令
例如‘–kv-transfer-config’如下
--kv-transfer-config \
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
}'
例如‘–additional-config’如下
--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'
解码节点
设置 HCCL_BUFFSIZE=1024
例如‘–kv-transfer-config’如下
--kv-transfer-config
'{"kv_connector": "MooncakeConnectorV1",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": "1",
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
}'
例如‘–additional-config’如下
--additional-config '{"enable_weight_nz_layout":true}'
参数说明#
1. ‘–additional-config’参数介绍
“enable_weight_nz_layout”: 是否将量化权重转换为NZ格式以加速矩阵乘法。
“enable_prefill_optimizations”: 是否启用DeepSeek模型的预填优化。
3.启用MTP 将以下命令添加到您的配置中。
--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'
推荐配置示例#
例如,如果平均输入长度为3.5k,输出长度为1.1k,上下文长度为16k,输入数据集的最大长度为7K。在此场景下,我们为具有高EP的分布式DP服务器提供了一个推荐配置。这里我们使用4个节点进行预填,4个节点进行解码。
节点 |
DP |
TP |
EP |
最大模型长度 |
最大批处理令牌数 |
最大序列数 |
GPU内存利用率 |
|---|---|---|---|---|---|---|---|
预填 |
2 |
8 |
16 |
17000 |
16384 |
4 |
0.9 |
解码 |
64 |
1 |
64 |
17000 |
256 |
28 |
0.9 |
注意
请注意,这些配置与优化无关。您需要根据实际场景调整这些参数。
常见问题解答#
1. 预填节点需要预热#
由于某些NPU算子的计算需要经过几轮预热才能达到最佳性能,因此我们建议在进行性能测试之前,使用一些请求预热服务,以达到最佳的端到端吞吐量。