昇腾应用商店部署指南

Ascend Store 部署指南#

环境依赖#

软件
- Python >= 3.10, < 3.12
- CANN == 8.3.rc2
- PyTorch == 2.8.0, torch-npu == 2.8.0
- vLLM：main 分支
- vLLM-Ascend：main 分支

KV Pooling 参数说明#

kv_connector_extra_config: Pooling 的额外可配置参数。
lookup_rpc_port: Pooling Scheduler 进程与 Worker 进程之间的 RPC 通信端口：每个实例都需要配置一个独有的端口。
load_async: 是否启用异步加载。默认值为 false。
backend: 设置 kvpool 的存储后端，默认为 mooncake。

使用 Mooncake 作为 KVCache pooling 后端示例#

软件
- 检查 NPU 网络配置
  
  确保环境中存在 hccn.conf 文件。如果使用 Docker，请将其挂载到容器中。
```
cat /etc/hccn.conf
```
- 安装 Mooncake
  
  Mooncake 是 Kimi 的 Serving 平台，Kimi 是 Moonshot AI 提供的领先的 LLM 服务。安装和编译指南：https://github.com/kvcache-ai/Mooncake?tab=readme-ov-file#build-and-use-binaries。首先，我们需要获取 Mooncake 项目。参考以下命令
```
git clone -b v0.3.7.post2 --depth 1 https://github.com/kvcache-ai/Mooncake.git
```
  (可选) 如果网络不佳，请替换 go install URL
```
cd Mooncake
sed -i 's|https://golang.ac.cn/dl/|https://golang.google.cn/dl/|g' dependencies.sh
```
  安装 mpi
```
apt-get install mpich libmpich-dev -y
```
  安装相关依赖。不需要安装 Go。
```
bash dependencies.sh -y
```
  编译和安装
```
mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON
make -j
make install
```
  设置环境变量
  
  注意
  - 根据您的具体 Python 安装调整 Python 路径
  - 确保 /usr/local/lib 和 /usr/local/lib64 在您的 LD_LIBRARY_PATH 中
```
export LD_LIBRARY_PATH=/usr/local/lib64/python3.11/site-packages/mooncake:$LD_LIBRARY_PATH
```

运行 Mooncake Master#

1. 配置 mooncake.json#

配置环境变量 MOONCAKE_CONFIG_PATH 为 mooncake.json 所在的完整路径。

{
    "local_hostname": "xx.xx.xx.xx",
    "metadata_server": "P2PHANDSHAKE",
    "protocol": "ascend",
    "device_name": "",
    "alloc_in_same_node": true,
    "master_server_address": "xx.xx.xx.xx:50088",
    "global_segment_size": "1GB" (1024MB/1048576KB/1073741824B/1073741824)
}

local_hostname: 配置为当前 master 节点的 IP 地址。
metadata_server: 配置为 P2PHANDSHAKE。
protocol: 配置 Ascend 使用 Mooncake 的 HCCL 通信。
device_name: “”
alloc_in_same_node: 指示是否优先采用本地缓冲区分配策略。
master_server_address: 配置 master 服务的 IP 和端口。
global_segment_size: PD 节点注册到 master 的 kvcache 大小的扩展。

2. 启动 mooncake_master#

在 mooncake 目录下

mooncake_master --port 50088 --eviction_high_watermark_ratio 0.95 --eviction_ratio 0.05

eviction_high_watermark_ratio 决定了 Mooncake Store 执行驱逐的阈值，而 eviction_ratio 决定了要被驱逐的存储对象的比例。

Pooling 和 Prefill Decode Disaggregate 场景#

1. 运行 `prefill` 节点和 `decode` 节点#

使用 MultiConnector 同时利用 p2p connector 和 pooled connector。P2P 执行 kv_transfer，而 pooling 创建更大的 prefix-cache。

prefill 节点：

bash multi_producer.sh

multi_producer.sh 脚本内容

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ACL_OP_INIT_MODE=1

# ASCEND_BUFFER_POOL is the environment variable for configuring the number and size of buffer on NPU Device for aggregation and KV transfer，the value 4:8 means we allocate 4 buffers of size 8MB.
export ASCEND_BUFFER_POOL=4:8

# Unit: ms. The timeout for one-sided communication connection establishment is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039). Users can adjust this value based on their specific setup.
# The recommended formula is: ASCEND_CONNECT_TIMEOUT = connection_time_per_card (typically within 500ms) × total_number_of_Decode_cards.
# This ensures that even in the worst-case scenario—where all Decode cards simultaneously attempt to connect to the same Prefill card the connection will not time out.
export ASCEND_CONNECT_TIMEOUT=10000

# Unit: ms. The timeout for one-sided communication transfer is set to 10 seconds by default (see PR: https://github.com/kvcache-ai/Mooncake/pull/1039).
export ASCEND_TRANSFER_TIMEOUT=10000

python3 -m vllm.entrypoints.openai.api_server \
    --model /xxxxx/Qwen2.5-7B-Instruct \
    --port 8100 \
    --trust-remote-code \
    --enforce-eager \
    --no_enable_prefix_caching \
    --tensor-parallel-size 1 \
    --data-parallel-size 1 \
    --max-model-len 10000 \
    --block-size 128 \
    --max-num-batched-tokens 4096 \
    --kv-transfer-config \
    '{
    "kv_connector": "MultiConnector",
    "kv_role": "kv_producer",
    "kv_connector_extra_config": {
        "use_layerwise": false,
        "connectors": [
            {
                "kv_connector": "MooncakeConnectorV1",
                "kv_role": "kv_producer",
                "kv_port": "20001",
                "kv_connector_extra_config": {
                    "use_ascend_direct": true,
                    "prefill": {
                        "dp_size": 1,
                        "tp_size": 1
                    },
                    "decode": {
                        "dp_size": 1,
                        "tp_size": 1
                    }
                }
            },
                    {
                "kv_connector": "AscendStoreConnector",
                "kv_role": "kv_producer",
                "lookup_rpc_port":"0",
                "backend": "mooncake"
                    }  
        ]
    }
    }'

decode 节点：

bash multi_consumer.sh

multi_consumer.sh 内容

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=4,5,6,7
export ACL_OP_INIT_MODE=1
export ASCEND_BUFFER_POOL=4:8
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000

python3 -m vllm.entrypoints.openai.api_server \
    --model /xxxxx/Qwen2.5-7B-Instruct \
    --port 8200 \
    --trust-remote-code \
    --enforce-eager \
    --no_enable_prefix_caching \
    --tensor-parallel-size 1 \
    --data-parallel-size 1 \
    --max-model-len 10000 \
    --block-size 128 \
    --max-num-batched-tokens 4096 \
    --kv-transfer-config \
    '{
    "kv_connector": "MultiConnector",
    "kv_role": "kv_consumer",
    "kv_connector_extra_config": {
        "use_layerwise": false,
        "connectors": [
        {
                "kv_connector": "MooncakeConnectorV1",
                "kv_role": "kv_consumer",
                "kv_port": "20002",
                "kv_connector_extra_config": {
                    "prefill": {
                        "dp_size": 1,
                        "tp_size": 1
                    },
                    "decode": {
                        "dp_size": 1,
                        "tp_size": 1
                    }
                }
            },
            {
                "kv_connector": "AscendStoreConnector",
                "kv_role": "kv_consumer",
                "lookup_rpc_port":"1",
                "backend": "mooncake"
            }
        ]
    }
    }'

2、启动 proxy_server。#

bash proxy.sh

proxy.sh 内容：将 localhost 替换为您的实际 IP 地址。

python vllm-ascend/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py \
    --host localhost\
    --prefiller-hosts localhost \
    --prefiller-ports 8100 \
    --decoder-hosts localhost\
    --decoder-ports 8200 \

3. 运行推理#

在命令中配置 localhost、port 和 model weight path 为您自己的设置。

短问题

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'

长问题

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'

Pooling 和 Mixed 部署场景#

1、运行 Mixed Department 脚本#

混合脚本本质上是 P 节点的纯 pooling 场景。

bash mixed_department.sh

mixed_department.sh 内容

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export PYTHONPATH=$PYTHONPATH:/xxxxx/vllm
export MOONCAKE_CONFIG_PATH="/xxxxxx/mooncake.json"
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ACL_OP_INIT_MODE=1
export ASCEND_BUFFER_POOL=4:8
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000

python3 -m vllm.entrypoints.openai.api_server \
    --model /xxxxx/Qwen2.5-7B-Instruct \
    --port 8100 \
    --trust-remote-code \
    --enforce-eager \
    --no_enable_prefix_caching \
    --tensor-parallel-size 1 \
    --data-parallel-size 1 \
    --max-model-len 10000 \
    --block-size 128 \
    --max-num-batched-tokens 4096 \
    --kv-transfer-config \
    '{
    "kv_connector": "AscendStoreConnector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
        "use_layerwise": false,
        "lookup_rpc_port":"1",
        "backend": "mooncake"
    }
}' > mix.log 2>&1

2. 运行推理#

在命令中配置 localhost、port 和 model weight path 为您自己的设置。发送的请求将仅路由到混合部署脚本所在的端口，无需启动单独的代理。

短问题

curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Hello. I have a question. The president of the United States is", "max_tokens": 200, "temperature":0.0 }'

长问题

curl -s http://localhost:8100/v1/completions -H "Content-Type: application/json" -d '{ "model": "/xxxxx/Qwen2.5-7B-Instruct", "prompt": "Given the accelerating impacts of climate change—including rising sea levels, increasing frequency of extreme weather events, loss of biodiversity, and adverse effects on agriculture and human health—there is an urgent need for a robust, globally coordinated response. However, international efforts are complicated by a range of factors: economic disparities between high-income and low-income countries, differing levels of industrialization, varying access to clean energy technologies, and divergent political systems that influence climate policy implementation. In this context, how can global agreements like the Paris Accord be redesigned or strengthened to not only encourage but effectively enforce emission reduction targets? Furthermore, what mechanisms can be introduced to promote fair and transparent technology transfer, provide adequate financial support for climate adaptation in vulnerable regions, and hold nations accountable without exacerbating existing geopolitical tensions or disproportionately burdening those with historically lower emissions?", "max_tokens": 256, "temperature":0.0 }'