Ascend Plugin

概述

参考:

vllm-project/vllm-ascend 项目是 vLLM 的昇腾插件,让 vllm 可以在 NPU 设备上使用

支持的模型: https://docs.vllm.ai/projects/ascend/en/latest/user_guide/support_matrix/supported_models.html

部署

容器化部署

可用的镜像位置

如果在国内,可以使用 daocloud 或其他镜像站点来加速下载:

TAG=v0.18.0
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:$TAG
# 或者
docker pull quay.nju.edu.cn/ascend/vllm-ascend:$TAG

export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0
# 使用容器运行
docker run --rm \
--name vllm-ascend \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash

使用 vLLM

参考:

[!Attention] 昇腾插件比较特殊,由于某些底层硬件的缺陷,在使用 vllm serve 时需要添加额外的参数才能保证正常启动。

在这个章节,只记录一些特殊情况,还有一些情况记录在最佳实践中。一般来说,vLLM 使用模型还是比较简单的,只需要 vllm serve ${Model} 即可。

e.g. --enforce-eager--dtype float16 这俩参数,就要在 Atlas 300I DUO 设备上用。

最佳实践

昇腾 Atlas 300I Duo 上的使用案例

由于使用的是 Atlas 300I Duo(芯片: 310p NPU),需要参考这里,修改一些参数的默认值,才可以加载模型。

前提:使用 HF_ENDPOINT=https://hf-mirror.com hf download Qwen/Qwen3-0.6B --local-dir /root/models/qwen3-0.6B 将模型下载到

一、启动 vLLM

# Atlas 300 推理系列支持较晚,使用 rc 版本的。
export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0rc1-310p-openeuler
docker run --rm \
--name vllm-ascend \
--network=host \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /root/models:/root/models \
-it $IMAGE bash

二、加载模型,提供推理服务

[!Attention] 32B 的模型只用一块卡显存不够 --tensor-parallel-size 1 的话会报错: RuntimeError: NPU out of memory. Tried to allocate 502.00 MiB (NPU 0; 43.24 GiB total capacity; 41.70 GiB already allocated; 41.70 GiB current active; 349.23 MiB free; 41.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

vllm serve \
  --enforce-eager --dtype float16 \
  --model /root/models/qwen3-32B --served-model-name qwen3-32b \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --tensor-parallel-size 4

Attention: --enforce-eager--dtype 是在 Atlas 300I Duo 加速卡上的适配参数。否则无法加载模型。

三、验证一下推理服务是否可以响应

# 列出模型
curl -s http://localhost:8000/v1/models | python3 -m json.tool
# 测试文本补全
curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "你好",
    "max_completion_tokens": 64,
    "top_p": 0.95,
    "top_k": 50,
    "temperature": 0.6
  }' | jq .

四、进入对话模式开始对话

docker exec -it vllm-ascend vllm chat

由于实现了 WebAPI 的推理服务,其它程序可以通过 http://localhost:8000/v1 使用 OpenAI 兼容的接口获取推理结果。

五、验证成功后,可以直接使用下面的命令一键拉起命令,这样也能通过 docker logs 检查日志

export IMAGE=quay.io/ascend/vllm-ascend:v0.18.0rc1-310p-openeuler
docker run -d \
--name vllm-ascend \
--network=host \
--shm-size=10g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-e VLLM_USE_MODELSCOPE=True \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /root/models:/root/models \
$IMAGE vllm serve \
  --enforce-eager --dtype float16 \
  --model /root/models/qwen3-32B --served-model-name qwen3-32b \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --tensor-parallel-size 4

重大变化

issue 7394 # 让 Qwen3.5 系列模型可以在 Atlas 300I Duo 上部署。创建于 2025-03-17

基准测试

Atlas 300I DUO

离线吞吐量基准测试

qwen3-0.6B

vllm bench throughput \
  --enforce-eager --dtype float16 \
  --model Qwen/Qwen3-0.6B \
  --input-len 256 --output-len 128 \
  --num-prompts 200 \
  --tensor-parallel-size 4

结果:

# 第一次
Throughput: 1.84 requests/s, 2124.55 total tokens/s, 236.06 output tokens/s
Total num prompt tokens:  204800
Total num output tokens:  25600
# 第二次
Throughput: 2.13 requests/s, 2458.73 total tokens/s, 273.19 output tokens/s
Total num prompt tokens:  204800
Total num output tokens:  25600
# 第三次
Throughput: 2.16 requests/s, 2489.63 total tokens/s, 276.63 output tokens/s
Total num prompt tokens:  204800
Total num output tokens:  25600

–tensor-parallel-size=1 结果:

# 第一次
Throughput: 1.64 requests/s, 1887.75 total tokens/s, 209.75 output tokens/s
Total num prompt tokens:  204800
Total num output tokens:  25600
# 第二次
Throughput: 1.64 requests/s, 1883.84 total tokens/s, 209.32 output tokens/s
Total num prompt tokens:  204800
Total num output tokens:  25600

qwen3-32B

vllm bench throughput \
  --enforce-eager --dtype float16 \
  --model /root/models/qwen3-32B \
  --input-len 256 --output-len 128 --num-prompts 200 \
  --tensor-parallel-size 4

结果:

# 第一次
Throughput: 0.55 requests/s, 638.73 total tokens/s, 70.97 output tokens/s
Total num prompt tokens:  204800
Total num output tokens:  25600
# 第二次
Throughput: 0.56 requests/s, 640.16 total tokens/s, 71.13 output tokens/s
Total num prompt tokens:  204800
Total num output tokens:  25600
# 第三次
Throughput: 0.55 requests/s, 631.02 total tokens/s, 70.11 output tokens/s
Total num prompt tokens:  204800
Total num output tokens:  25600

在线基准测试

qwen3-0.6B

# 启动推理服务
vllm serve \
  --enforce-eager --dtype float16 \
  --model Qwen/Qwen3-0.6B \
  --tensor-parallel-size 4
# 对聊天接口基准测试
export bench_method="--num-prompts 20 --max-concurrency 1"
export model_id=$(curl -s http://localhost:8000/v1/models | python3 -c "import sys, json; print(json.load(sys.stdin)['data'][0]['id'])")
vllm bench serve \
  --model ${model_id} \
  --backend openai-chat --endpoint /v1/chat/completions \
  --dataset-name random \
  --random-input-len 256 --random-output-len 128 \
  ${bench_method}

结果:

测试方法: --num-prompts 20 --max-concurrency 1测试方法: --num-prompts 200 --max-concurrency 8
============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  155.71    
Total input tokens:                      5120      
Total generated tokens:                  2560      
Request throughput (req/s):              0.13      
Output token throughput (tok/s):         16.44     
Peak output token throughput (tok/s):    18.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          49.32     
---------------Time to First Token----------------
Mean TTFT (ms):                          123.72    
Median TTFT (ms):                        88.58     
P99 TTFT (ms):                           656.70    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          60.33     
Median TPOT (ms):                        58.51     
P99 TPOT (ms):                           85.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.85     
Median ITL (ms):                         58.04     
P99 ITL (ms):                            149.20    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  182.42    
Total input tokens:                      51200     
Total generated tokens:                  25600     
Request throughput (req/s):              1.10      
Output token throughput (tok/s):         140.34    
Peak output token throughput (tok/s):    160.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          421.02    
---------------Time to First Token----------------
Mean TTFT (ms):                          705.70    
Median TTFT (ms):                        661.14    
P99 TTFT (ms):                           1847.60   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.89     
Median TPOT (ms):                        52.02     
P99 TPOT (ms):                           52.91     
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.89     
Median ITL (ms):                         52.17     
P99 ITL (ms):                            53.87     
==================================================

qwen3-32B

# 启动推理服务
vllm serve \
  --enforce-eager --dtype float16 \
  --model /root/models/qwen3-32B \
  --served-model-name qwen3-32b \
  --tensor-parallel-size 4
# 对聊天接口基准测试
export bench_method="--num-prompts 20 --max-concurrency 1"
export model_id=$(curl -s http://localhost:8000/v1/models | python3 -c "import sys, json; print(json.load(sys.stdin)['data'][0]['id'])")
vllm bench serve \
  --model ${model_id} \
  --tokenizer /root/models/qwen3-32B \
  --backend openai-chat --endpoint /v1/chat/completions \
  --dataset-name random \
  --random-input-len 256 --random-output-len 128 \
  ${bench_method}

[!Attention] 由于 bench serve 的缺陷,需要手动添加 –tokenize 参数 原因是 vllm bench serve 在初始化 tokenizer 时,把 –model qwen3-32b 当成了 HuggingFace 的 repo_id ,本地没查到需要去下载,但 qwen3-32b 不是 namespace/name 格式,所以报错。

结果:

测试方法: --num-prompts 20 --max-concurrency 1测试方法: --num-prompts 200 --max-concurrency 8
============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  334.72    
Total input tokens:                      5120      
Total generated tokens:                  2560      
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         7.65      
Peak output token throughput (tok/s):    9.00      
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          22.94     
---------------Time to First Token----------------
Mean TTFT (ms):                          328.02    
Median TTFT (ms):                        293.29    
P99 TTFT (ms):                           852.37    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          129.19    
Median TPOT (ms):                        127.83    
P99 TPOT (ms):                           157.93    
---------------Inter-token Latency----------------
Mean ITL (ms):                           128.18    
Median ITL (ms):                         127.89    
P99 ITL (ms):                            140.57    
==================================================
============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  541.15    
Total input tokens:                      51200     
Total generated tokens:                  25600     
Request throughput (req/s):              0.37      
Output token throughput (tok/s):         47.31     
Peak output token throughput (tok/s):    58.00     
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          141.92    
---------------Time to First Token----------------
Mean TTFT (ms):                          2070.00   
Median TTFT (ms):                        2212.61   
P99 TTFT (ms):                           2743.95   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          154.09    
Median TPOT (ms):                        149.52    
P99 TPOT (ms):                           203.62    
---------------Inter-token Latency----------------
Mean ITL (ms):                           152.89    
Median ITL (ms):                         144.33    
P99 ITL (ms):                            450.19    
==================================================

最后修改 May 11, 2026: add MindSpore (f5a0c5fe)