diff --git a/README.md b/README.md index e1a9d86..64feba6 100644 --- a/README.md +++ b/README.md @@ -2,13 +2,13 @@ ## Introduction -This Ascend device plugin is implemented for [HAMi](https://github.com/Project-HAMi/HAMi) scheduling. +This Ascend device plugin is implemented for [HAMi](https://github.com/Project-HAMi/HAMi) and [volcano](https://github.com/volcano-sh/volcano) scheduling. -Memory slicing is supported based on virtualization template, lease available template is automatically used. For detailed information, check [templeate](./config.yaml) +Memory slicing is supported based on virtualization template, lease available template is automatically used. For detailed information, check [template](./ascend-device-configmap.yaml) ## Prerequisites -[ascend-docker-runtime](https://gitee.com/ascend/ascend-docker-runtime) +[ascend-docker-runtime](https://gitcode.com/Ascend/mind-cluster/tree/master/component/ascend-docker-runtime) ## Compile @@ -24,51 +24,32 @@ docker buildx build -t $IMAGE_NAME . ## Deployment -Due to dependencies with HAMi, you need to set +### Label the Node with `ascend=on` -``` -devices.ascend.enabled=true -``` - -during HAMi installation. For more details, see 'devices' section in values.yaml. -```yaml -devices: - ascend: - enabled: true - image: "ascend-device-plugin:master" - imagePullPolicy: IfNotPresent - extraArgs: [] - nodeSelector: - ascend: "on" - tolerations: [] - resources: - - huawei.com/Ascend910A - - huawei.com/Ascend910A-memory - - huawei.com/Ascend910B - - huawei.com/Ascend910B-memory - - huawei.com/Ascend310P - - huawei.com/Ascend310P-memory ``` +kubectl label node {ascend-node} ascend=on +``` -Note that resources here(hawei.com/Ascend910A,huawei.com/Ascend910B,...) is managed in hami-scheduler-device configMap. It defines three different templates(910A,910B,310P). - -label your NPU nodes with 'ascend=on' +### Deploy ConfigMap ``` -kubectl label node {ascend-node} ascend=on +kubectl apply -f ascend-device-configmap.yaml ``` -Deploy ascend-device-plugin by running +### Deploy `ascend-device-plugin` ```bash kubectl apply -f ascend-device-plugin.yaml ``` +If scheduling Ascend devices in HAMi, simply set `devices.ascend.enabled` to true when deploying HAMi, and the ConfigMap and `ascend-device-plugin` will be automatically deployed. refer https://github.com/Project-HAMi/HAMi/blob/master/charts/hami/README.md#huawei-ascend ## Usage -You can allocate a slice of NPU by specifying both resource number and resource memory. For more examples, see [examples](./examples/) +To exclusively use an entire card or request multiple cards, you only need to set the corresponding resourceName. If multiple tasks need to share the same NPU, you need to set the corresponding resource request to 1 and configure the appropriate ResourceMemoryName. + +### Usage in HAMi ```yaml ... @@ -81,3 +62,26 @@ You can allocate a slice of NPU by specifying both resource number and resource # if you don't specify Ascend910B-memory, it will use a whole NPU. huawei.com/Ascend910B-memory: "4096" ``` + For more examples, see [examples](./examples/) + + ### Usage in volcano + + Volcano must be installed prior to usage, for more information see [here](https://github.com/volcano-sh/volcano/tree/master/docs/user-guide/how_to_use_vnpu.md) + + ```yaml +apiVersion: v1 +kind: Pod +metadata: + name: ascend-pod +spec: + schedulerName: volcano + containers: + - name: ubuntu-container + image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04 + command: ["sleep"] + args: ["100000"] + resources: + limits: + huawei.com/Ascend310P: "1" + huawei.com/Ascend310P-memory: "4096" + ``` \ No newline at end of file diff --git a/README_cn.md b/README_cn.md index 1dafac1..156ca53 100644 --- a/README_cn.md +++ b/README_cn.md @@ -2,15 +2,15 @@ ## 说明 -基于[HAMi](https://github.com/Project-HAMi/HAMi)调度机制的ascend device plugin。 +Ascend device plugin 是用来支持在 [HAMi](https://github.com/Project-HAMi/HAMi) 和 [volcano](https://github.com/volcano-sh/volcano) 中调度昇腾NPU设备. -支持基于显存调度,显存是基于昇腾的虚拟化模板来切分的,会找到满足显存需求的最小模板来作为容器的显存。模版的具体信息参考[配置模版](./config.yaml) +昇腾NPU虚拟化切分是通过模板来配置的,在调度时会找到满足显存需求的最小模板来作为容器的显存。各芯片的模板配置信息参考[这里](./ascend-device-configmap.yaml) -启动容器依赖[ascend-docker-runtime](https://gitee.com/ascend/ascend-docker-runtime)。 +## 环境要求 -## 编译 +部署 [ascend-docker-runtime](https://gitcode.com/Ascend/mind-cluster/tree/master/component/ascend-docker-runtime) -### 编译二进制文件 +## 编译 ```bash make all @@ -24,47 +24,33 @@ docker buildx build -t $IMAGE_NAME . ## 部署 -由于和HAMi的一些依赖关系,部署集成在HAMi的部署中,指定以下字段: - -``` -devices.ascend.enabled=true -``` +### 给 Node 打 ascend 标签 -相关的每一种NPU设备的资源名,参考values.yaml中的以下字段,目前本组件支持3种型号的NPU切片(310p,910A,910B)若不需要修改的话可以直接使用以下的默认配置: -```yaml -devices: - ascend: - enabled: true - image: "ascend-device-plugin:master" - imagePullPolicy: IfNotPresent - extraArgs: [] - nodeSelector: - ascend: "on" - tolerations: [] - resources: - - huawei.com/Ascend910A - - huawei.com/Ascend910A-memory - - huawei.com/Ascend910B - - huawei.com/Ascend910B-memory - - huawei.com/Ascend310P - - huawei.com/Ascend310P-memory +``` +kubectl label node {ascend-node} ascend=on ``` -将集群中的NPU节点打上如下标签: +### 部署 ConfigMap ``` -kubectl label node {ascend-node} ascend=on +kubectl apply -f ascend-device-configmap.yaml ``` -最后使用以下指令部署ascend-device-plugin +### 部署 `ascend-device-plugin` ```bash kubectl apply -f ascend-device-plugin.yaml ``` +如果要在HAMi中使用升腾NPU, 在部署HAMi时设置 `devices.ascend.enabled` 为 true 会自动部署 ConfigMap 和 `ascend-device-plugin`。 参考 https://github.com/Project-HAMi/HAMi/blob/master/charts/hami/README.md#huawei-ascend + ## 使用 +如果要独占整卡或者申请多张卡只需要设置对应的 resourceName 即可。如果多个任务要共享同一张卡,需要将 resourceName 设置为1,并且设置对应的 ResourceMemoryName。 + +### 在 HAMi 中使用 + ```yaml ... containers: @@ -73,6 +59,29 @@ kubectl apply -f ascend-device-plugin.yaml resources: limits: huawei.com/Ascend910B: "1" - # 不填写显存默认使用整张卡 + # 如果不指定显存大小, 就会使用整张卡 huawei.com/Ascend910B-memory: "4096" ``` + For more examples, see [examples](./examples/) + + ### 在 volcano 中使用 + + 在 volcano 中使用时需要提前部署好 volcano, 更多信息请[参考这里](https://github.com/volcano-sh/volcano/tree/master/docs/user-guide/how_to_use_vnpu.md) + + ```yaml +apiVersion: v1 +kind: Pod +metadata: + name: ascend-pod +spec: + schedulerName: volcano + containers: + - name: ubuntu-container + image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04 + command: ["sleep"] + args: ["100000"] + resources: + limits: + huawei.com/Ascend310P: "1" + huawei.com/Ascend310P-memory: "4096" + ``` \ No newline at end of file diff --git a/hami-scheduler-device.yaml b/ascend-device-configmap.yaml similarity index 100% rename from hami-scheduler-device.yaml rename to ascend-device-configmap.yaml diff --git a/config.yaml b/config.yaml deleted file mode 100644 index 945e692..0000000 --- a/config.yaml +++ /dev/null @@ -1,79 +0,0 @@ -vnpus: -- chipName: 910A - commonWord: Ascend910A - resourceName: huawei.com/Ascend910A - resourceMemoryName: huawei.com/Ascend910A-memory - memoryAllocatable: 32768 - memoryCapacity: 32768 - aiCore: 30 - templates: - - name: vir02 - memory: 2184 - aiCore: 2 - - name: vir04 - memory: 4369 - aiCore: 4 - - name: vir08 - memory: 8738 - aiCore: 8 - - name: vir16 - memory: 17476 - aiCore: 16 -- chipName: 910B3 - commonWord: Ascend910B3 - resourceName: huawei.com/Ascend910B3 - resourceMemoryName: huawei.com/Ascend910B3-memory - memoryAllocatable: 65536 - memoryCapacity: 65536 - aiCore: 20 - aiCPU: 7 - templates: - - name: vir05_1c_16g - memory: 16384 - aiCore: 5 - aiCPU: 1 - - name: vir10_3c_32g - memory: 32768 - aiCore: 10 - aiCPU: 3 -- chipName: 310P3 - commonWord: Ascend310P - resourceName: huawei.com/Ascend310P - resourceMemoryName: huawei.com/Ascend310P-memory - memoryAllocatable: 21527 - memoryCapacity: 24576 - aiCore: 8 - aiCPU: 7 - templates: - - name: vir01 - memory: 3072 - aiCore: 1 - aiCPU: 1 - - name: vir02 - memory: 6144 - aiCore: 2 - aiCPU: 2 - - name: vir04 - memory: 12288 - aiCore: 4 - aiCPU: 4 -- chipName: 910ProB - commonWord: Ascend910ProB - resourceName: huawei.com/Ascend910ProB - resourceMemoryName: huawei.com/Ascend910ProB-memory - memoryAllocatable: 32768 - memoryCapacity: 32768 - aiCore: 30 - templates: - - name: vir02 - memory: 2184 - aiCore: 2 - - name: vir04 - memory: 4369 - aiCore: 4 - - name: vir08 - memory: 8738 - aiCore: 8 - - name: vir16 - memory: 17476 - aiCore: 16 \ No newline at end of file