Distributed DP Server With Large Scale Expert Parallelism

分布式DP服务器与大规模专家并行#

入门指南#

vLLM-Ascend现在支持大规模专家并行 (EP)场景下的预填-解码 (PD) 分离。为了获得更好的性能，vLLM-Ascend采用了分布式DP服务器。在PD分离场景下，可以根据PD节点的特性差异实现不同的优化策略，从而实现更灵活的模型部署。
以deepseek模型为例，使用8台Atlas 800T A3服务器进行模型部署。假设服务器的IP从192.0.0.1开始，到192.0.0.8结束。使用前4台服务器作为预填节点，后4台服务器作为解码节点。预填节点独立部署为主节点，解码节点将192.0.0.5节点设为主节点。

验证多节点通信环境#

物理层要求：#

物理机必须位于同一局域网内，且网络连通。
所有NPU必须互连。对于Atlas A2代，节点内连接通过HCCS，节点间连接通过RDMA。对于Atlas A3代，节点内和节点间连接都通过HCCS。

验证流程:#

A3

单节点验证

在每个节点上依次执行以下命令。结果必须全部为 success 且状态必须为 UP

 # Check the remote switch ports
 for i in {0..15}; do hccn_tool -i $i -lldp -g | grep Ifname; done
 # Get the link status of the Ethernet ports (UP or DOWN)
 for i in {0..15}; do hccn_tool -i $i -link -g ; done
 # Check the network health status
 for i in {0..15}; do hccn_tool -i $i -net_health -g ; done
 # View the network detected IP configuration
 for i in {0..15}; do hccn_tool -i $i -netdetect -g ; done
 # View gateway configuration
 for i in {0..15}; do hccn_tool -i $i -gateway -g ; done
 # View NPU network configuration
 cat /etc/hccn.conf

获取 NPU IP 地址

for i in {0..15}; do hccn_tool -i $i -vnic -g;done

获取superpodid和SDID

for i in {0..15}; do npu-smi info -t spod-info -i $i -c 0;npu-smi info -t spod-info -i $i -c 1;done

跨节点PING测试

# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..15}; do hccn_tool -i $i -hccs_ping -g address x.x.x.x;done

A2

单节点验证

在每个节点上依次执行以下命令。结果必须全部为 success 且状态必须为 UP

# Check the remote switch ports
for i in {0..7}; do hccn_tool -i $i -lldp -g | grep Ifname; done
# Get the link status of the Ethernet ports (UP or DOWN)
for i in {0..7}; do hccn_tool -i $i -link -g ; done
# Check the network health status
for i in {0..7}; do hccn_tool -i $i -net_health -g ; done
# View the network detected IP configuration
for i in {0..7}; do hccn_tool -i $i -netdetect -g ; done
# View gateway configuration
for i in {0..7}; do hccn_tool -i $i -gateway -g ; done
# View NPU network configuration
cat /etc/hccn.conf

获取 NPU IP 地址

for i in {0..7}; do hccn_tool -i $i -ip -g;done

跨节点PING测试

# Execute on the target node (replace 'x.x.x.x' with actual npu ip address)
for i in {0..7}; do hccn_tool -i $i -ping -g address x.x.x.x;done

大规模EP模型部署#

生成带有配置的脚本#

在PD分离场景下，我们提供了一个优化的配置。您可以使用以下shell脚本分别配置预填和解码节点。

预填节点

# run_dp_template.sh
#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"

# basic configuration for HCCL and connection
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=256

# obtain parameters from distributed DP server
export VLLM_DP_SIZE=$1
export VLLM_DP_MASTER_IP=$2
export VLLM_DP_MASTER_PORT=$3
export VLLM_DP_RANK_LOCAL=$4
export VLLM_DP_RANK=$5
export VLLM_DP_SIZE_LOCAL=$7

#pytorch_npu settings and vllm settings
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export VLLM_USE_MODELSCOPE="True"

# enable the distributed DP server
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1

# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
    --host 0.0.0.0 \
    --port $6 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --seed 1024 \
    --served-model-name deepseek_r1 \
    --max-model-len 17000 \
    --max-num-batched-tokens 16384 \
    --trust-remote-code \
    --max-num-seqs 4 \
    --gpu-memory-utilization 0.9 \
    --quantization ascend \
    --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
    --enforce-eager \
    --kv-transfer-config \
    '{"kv_connector": "MooncakeConnectorV1",
      "kv_buffer_device": "npu",
      "kv_role": "kv_producer",
      "kv_parallel_size": "1",
      "kv_port": "20001",
      "engine_id": "0",
      "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
    }'
    --additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'

解码节点

# run_dp_template.sh
#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip
nic_name="xxxx"
local_ip="xxxx"

# basic configuration for HCCL and connection
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export HCCL_BUFFSIZE=1024

# obtain parameters from distributed DP server
export VLLM_DP_SIZE=$1
export VLLM_DP_MASTER_IP=$2
export VLLM_DP_MASTER_PORT=$3
export VLLM_DP_RANK_LOCAL=$4
export VLLM_DP_RANK=$5
export VLLM_DP_SIZE_LOCAL=$7

#pytorch_npu settings and vllm settings
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export TASK_QUEUE_ENABLE=1
export VLLM_USE_MODELSCOPE="True"

# enable the distributed DP server
export VLLM_WORKER_MULTIPROC_METHOD="fork"
export VLLM_ASCEND_EXTERNAL_DP_LB_ENABLED=1

# The w8a8 weight can obtained from https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8
# "--additional-config" is used to enable characteristics from vllm-ascend
vllm serve vllm-ascend/DeepSeek-R1-W8A8 \
    --host 0.0.0.0 \
    --port $6 \
    --tensor-parallel-size 1 \
    --enable-expert-parallel \
    --seed 1024 \
    --served-model-name deepseek_r1 \
    --max-model-len 17000 \
    --max-num-batched-tokens 256 \
    --trust-remote-code \
    --max-num-seqs 28 \
    --gpu-memory-utilization 0.9 \
    --quantization ascend \
    --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \
    --kv-transfer-config \
        '{"kv_connector": "MooncakeConnectorV1",
        "kv_buffer_device": "npu",
        "kv_role": "kv_consumer",
        "kv_parallel_size": "1",
        "kv_port": "20001",
        "engine_id": "0",
        "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
        }' \
    --additional-config '{"enable_weight_nz_layout":true}'

启动分布式DP服务器以实现预填-解码分离#

在所有节点上执行以下Python文件以使用分布式DP服务器。（我们推荐在v0.9.1官方版本上使用此功能）

预填节点

import multiprocessing
import os
import sys
dp_size = 2 # total number of DP engines for decode/prefill
dp_size_local = 2 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node
# dp_ip is different on prefiller nodes in this example
dp_ip = "192.0.0.1" # master node ip for DP communication
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
  print(f"Template file {template_path} does not exist.")
  sys.exit(1)
def run_command(dp_rank_local, dp_rank, engine_port_):
  command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
  os.system(command)
processes = []
for i in range(dp_size_local):
  dp_rank = dp_rank_start + i
  dp_rank_local = i
  engine_port_ = engine_port + i
  process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
  processes.append(process)
  process.start()
for process in processes:
  process.join()

解码节点

import multiprocessing
import os
import sys
dp_size = 64 # total number of DP engines for decode/prefill
dp_size_local = 16 # number of DP engines on the current node
dp_rank_start = 0 # starting DP rank for the current node. e.g. 0/16/32/48
# dp_ip is the same on decoder nodes in this example
dp_ip = "192.0.0.5" # master node ip for DP communication.
dp_port = 13395 # port used for DP communication
engine_port = 9000 # starting port for all DP groups on the current node
template_path = "./run_dp_template.sh"
if not os.path.exists(template_path):
  print(f"Template file {template_path} does not exist.")
  sys.exit(1)
def run_command(dp_rank_local, dp_rank, engine_port_):
  command = f"bash ./run_dp_template.sh {dp_size} {dp_ip} {dp_port} {dp_rank_local} {dp_rank} {engine_port_} {dp_size_local}"
  os.system(command)
processes = []
for i in range(dp_size_local):
  dp_rank = dp_rank_start + i
  dp_rank_local = i
  engine_port_ = engine_port + i
  process = multiprocessing.Process(target=run_command, args=(dp_rank_local, dp_rank, engine_port_))
  processes.append(process)
  process.start()
for process in processes:
  process.join()

请注意，预填节点和解码节点可能具有不同的配置。在此示例中，每个预填节点独立部署为主节点，但所有解码节点都将第一个节点视为主节点。因此，在‘dp_size_local’和‘dp_rank_start’上存在差异。

分布式DP服务器的示例代理#

在PD分离场景下，我们需要一个代理来分发请求。执行以下命令以启用示例代理。

python load_balance_proxy_server_example.py \
  --port 8000 \
  --host 0.0.0.0 \
  --prefiller-hosts \
    192.0.0.1 \
    192.0.0.2 \
    192.0.0.3 \
    192.0.0.4 \
  --prefiller-hosts-num \
    2 2 2 2 \
  --prefiller-ports \
    9000 9000 9000 9000 \
  --prefiller-ports-inc \
    2 2 2 2\
  --decoder-hosts \
    192.0.0.5 \
    192.0.0.6 \
    192.0.0.7 \
    192.0.0.8 \
  --decoder-hosts-num \
    16 16 16 16 \
  --decoder-ports  \
    9000 9000 9000 9000 \
  --decoder-ports-inc \
    16 16 16 16 \

参数	含义
–port	代理服务端口
–host	代理服务主机IP
–prefiller-hosts	预填节点的主机
–prefiller-hosts-num	预填节点主机重复次数
–prefiller-ports	预填节点的端口
–prefiller-ports-inc	预填节点端口的增量
–decoder-hosts	解码节点的主机
–decoder-hosts-num	解码节点主机重复次数
–decoder-ports	解码节点端口
–decoder-ports-inc	解码节点端口的增量

您可以在仓库的examples中找到代理程序，load_balance_proxy_server_example.py

基准测试#

我们建议使用aisbench工具来评估性能。aisbench 执行以下命令安装aisbench

git clone https://gitee.com/aisbench/benchmark.git
cd benchmark/
pip3 install -e ./

在评估性能之前，您需要取消http代理，如下所示。

# unset proxy
unset http_proxy
unset https_proxy

您可以将数据集放在目录：benchmark/ais_bench/datasets
您可以在目录中更改配置：benchmark/ais_bench/benchmark/configs/models/vllm_api 以vllm_api_stream_chat.py为例。

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="vllm-ascend/DeepSeek-R1-W8A8",
        model="dsr1",
        request_rate = 28,
        retry = 2,
        host_ip = "192.0.0.1", # Proxy service host IP
        host_port = 8000,  # Proxy service Port
        max_out_len = 10,
        batch_size=1536,
        trust_remote_code=True,
        generation_kwargs = dict(
            temperature = 0,
            seed = 1024,
            ignore_eos=False,
        )
    )
]

以gsm8k数据集为例，执行以下命令评估性能。

ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf  --debug  --mode perf

有关aisbench的命令和参数的更多详细信息，请参阅aisbench。

预填与解码配置详情#

在PD分离场景下，我们提供了一个优化的配置。

预填节点

设置 HCCL_BUFFSIZE=256
在‘vllm serve’中添加‘–enforce-eager’命令
例如‘–kv-transfer-config’如下

--kv-transfer-config \
    '{"kv_connector": "MooncakeConnectorV1",
      "kv_buffer_device": "npu",
      "kv_role": "kv_producer",
      "kv_parallel_size": "1",
      "kv_port": "20001",
      "engine_id": "0",
      "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
    }'

例如‘–additional-config’如下

--additional-config '{"enable_weight_nz_layout":true,"enable_prefill_optimizations":true}'

解码节点

设置 HCCL_BUFFSIZE=1024
例如‘–kv-transfer-config’如下

--kv-transfer-config
    '{"kv_connector": "MooncakeConnectorV1",
      "kv_buffer_device": "npu",
      "kv_role": "kv_consumer",
      "kv_parallel_size": "1",
      "kv_port": "20001",
      "engine_id": "0",
      "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector"
    }'

例如‘–additional-config’如下

--additional-config '{"enable_weight_nz_layout":true}'

参数说明#

1. ‘–additional-config’参数介绍

“enable_weight_nz_layout”： 是否将量化权重转换为NZ格式以加速矩阵乘法。
“enable_prefill_optimizations”： 是否启用DeepSeek模型的预填优化。

3.启用MTP 将以下命令添加到您的配置中。

--speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'

常见问题解答#

1. 预填节点需要预热#

由于某些NPU算子的计算需要经过几轮预热才能达到最佳性能，因此我们建议在进行性能测试之前，使用一些请求预热服务，以达到最佳的端到端吞吐量。

节点	DP	TP	EP	最大模型长度	最大批处理令牌数	最大序列数	GPU内存利用率
预填	2	8	16	17000	16384	4	0.9
解码	64	1	64	17000	256	28	0.9