Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 36 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

## Introduction

This Ascend device plugin is implemented for [HAMi](https://github.com/Project-HAMi/HAMi) scheduling.
This Ascend device plugin is implemented for [HAMi](https://github.com/Project-HAMi/HAMi) and [volcano](https://github.com/volcano-sh/volcano) scheduling.

Memory slicing is supported based on virtualization template, lease available template is automatically used. For detailed information, check [templeate](./config.yaml)
Memory slicing is supported based on virtualization template, lease available template is automatically used. For detailed information, check [template](./ascend-device-configmap.yaml)

## Prerequisites

[ascend-docker-runtime](https://gitee.com/ascend/ascend-docker-runtime)
[ascend-docker-runtime](https://gitcode.com/Ascend/mind-cluster/tree/master/component/ascend-docker-runtime)

## Compile

Expand All @@ -24,51 +24,32 @@ docker buildx build -t $IMAGE_NAME .

## Deployment

Due to dependencies with HAMi, you need to set
### Label the Node with `ascend=on`

```
devices.ascend.enabled=true
```

during HAMi installation. For more details, see 'devices' section in values.yaml.

```yaml
devices:
ascend:
enabled: true
image: "ascend-device-plugin:master"
imagePullPolicy: IfNotPresent
extraArgs: []
nodeSelector:
ascend: "on"
tolerations: []
resources:
- huawei.com/Ascend910A
- huawei.com/Ascend910A-memory
- huawei.com/Ascend910B
- huawei.com/Ascend910B-memory
- huawei.com/Ascend310P
- huawei.com/Ascend310P-memory
```
kubectl label node {ascend-node} ascend=on
```

Note that resources here(hawei.com/Ascend910A,huawei.com/Ascend910B,...) is managed in hami-scheduler-device configMap. It defines three different templates(910A,910B,310P).

label your NPU nodes with 'ascend=on'
### Deploy ConfigMap

```
kubectl label node {ascend-node} ascend=on
kubectl apply -f ascend-device-configmap.yaml
```

Deploy ascend-device-plugin by running
### Deploy `ascend-device-plugin`

```bash
kubectl apply -f ascend-device-plugin.yaml
```

If scheduling Ascend devices in HAMi, simply set `devices.ascend.enabled` to true when deploying HAMi, and the ConfigMap and `ascend-device-plugin` will be automatically deployed. refer https://github.com/Project-HAMi/HAMi/blob/master/charts/hami/README.md#huawei-ascend

## Usage

You can allocate a slice of NPU by specifying both resource number and resource memory. For more examples, see [examples](./examples/)
To exclusively use an entire card or request multiple cards, you only need to set the corresponding resourceName. If multiple tasks need to share the same NPU, you need to set the corresponding resource request to 1 and configure the appropriate ResourceMemoryName.

### Usage in HAMi

```yaml
...
Expand All @@ -81,3 +62,26 @@ You can allocate a slice of NPU by specifying both resource number and resource
# if you don't specify Ascend910B-memory, it will use a whole NPU.
huawei.com/Ascend910B-memory: "4096"
```
For more examples, see [examples](./examples/)

### Usage in volcano

Volcano must be installed prior to usage, for more information see [here](https://github.com/volcano-sh/volcano/tree/master/docs/user-guide/how_to_use_vnpu.md)

```yaml
apiVersion: v1
kind: Pod
metadata:
name: ascend-pod
spec:
schedulerName: volcano
containers:
- name: ubuntu-container
image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
command: ["sleep"]
args: ["100000"]
resources:
limits:
huawei.com/Ascend310P: "1"
huawei.com/Ascend310P-memory: "4096"
```
73 changes: 41 additions & 32 deletions README_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

## 说明

基于[HAMi](https://github.com/Project-HAMi/HAMi)调度机制的ascend device plugin。
Ascend device plugin 是用来支持在 [HAMi](https://github.com/Project-HAMi/HAMi) 和 [volcano](https://github.com/volcano-sh/volcano) 中调度昇腾NPU设备.

支持基于显存调度,显存是基于昇腾的虚拟化模板来切分的,会找到满足显存需求的最小模板来作为容器的显存。模版的具体信息参考[配置模版](./config.yaml)
昇腾NPU虚拟化切分是通过模板来配置的,在调度时会找到满足显存需求的最小模板来作为容器的显存。各芯片的模板配置信息参考[这里](./ascend-device-configmap.yaml)

启动容器依赖[ascend-docker-runtime](https://gitee.com/ascend/ascend-docker-runtime)。
## 环境要求

## 编译
部署 [ascend-docker-runtime](https://gitcode.com/Ascend/mind-cluster/tree/master/component/ascend-docker-runtime)

### 编译二进制文件
## 编译

```bash
make all
Expand All @@ -24,47 +24,33 @@ docker buildx build -t $IMAGE_NAME .

## 部署

由于和HAMi的一些依赖关系,部署集成在HAMi的部署中,指定以下字段:

```
devices.ascend.enabled=true
```
### 给 Node 打 ascend 标签

相关的每一种NPU设备的资源名,参考values.yaml中的以下字段,目前本组件支持3种型号的NPU切片(310p,910A,910B)若不需要修改的话可以直接使用以下的默认配置:

```yaml
devices:
ascend:
enabled: true
image: "ascend-device-plugin:master"
imagePullPolicy: IfNotPresent
extraArgs: []
nodeSelector:
ascend: "on"
tolerations: []
resources:
- huawei.com/Ascend910A
- huawei.com/Ascend910A-memory
- huawei.com/Ascend910B
- huawei.com/Ascend910B-memory
- huawei.com/Ascend310P
- huawei.com/Ascend310P-memory
```
kubectl label node {ascend-node} ascend=on
```

将集群中的NPU节点打上如下标签:
### 部署 ConfigMap

```
kubectl label node {ascend-node} ascend=on
kubectl apply -f ascend-device-configmap.yaml
```

最后使用以下指令部署ascend-device-plugin
### 部署 `ascend-device-plugin`

```bash
kubectl apply -f ascend-device-plugin.yaml
```

如果要在HAMi中使用升腾NPU, 在部署HAMi时设置 `devices.ascend.enabled` 为 true 会自动部署 ConfigMap 和 `ascend-device-plugin`。 参考 https://github.com/Project-HAMi/HAMi/blob/master/charts/hami/README.md#huawei-ascend

## 使用

如果要独占整卡或者申请多张卡只需要设置对应的 resourceName 即可。如果多个任务要共享同一张卡,需要将 resourceName 设置为1,并且设置对应的 ResourceMemoryName。

### 在 HAMi 中使用

```yaml
...
containers:
Expand All @@ -73,6 +59,29 @@ kubectl apply -f ascend-device-plugin.yaml
resources:
limits:
huawei.com/Ascend910B: "1"
# 不填写显存默认使用整张卡
# 如果不指定显存大小, 就会使用整张卡
huawei.com/Ascend910B-memory: "4096"
```
For more examples, see [examples](./examples/)

### 在 volcano 中使用

在 volcano 中使用时需要提前部署好 volcano, 更多信息请[参考这里](https://github.com/volcano-sh/volcano/tree/master/docs/user-guide/how_to_use_vnpu.md)

```yaml
apiVersion: v1
kind: Pod
metadata:
name: ascend-pod
spec:
schedulerName: volcano
containers:
- name: ubuntu-container
image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
command: ["sleep"]
args: ["100000"]
resources:
limits:
huawei.com/Ascend310P: "1"
huawei.com/Ascend310P-memory: "4096"
```
File renamed without changes.
79 changes: 0 additions & 79 deletions config.yaml

This file was deleted.