-
Notifications
You must be signed in to change notification settings - Fork 40
Description
1.Problem description
I created a pod and set nvidia. com/gpucores=20. When I started the training task in the pod, I saw that the computing power utilization rate was 100% in the HAMI-WEB asset overview. But the correct one should be 20%。
2.Environment Configuration
I configured gpuCorePolicy=force
3.Here is the YAML file where I created the pod
cat test.yml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-pod
spec:
replicas: 1
selector:
matchLabels:
app: gpu-app
template:
metadata:
labels:
app: gpu-app
spec:
containers:
- name: simple-container
image: pytorch/pytorch:2.9.0-cuda13.0-cudnn9-runtime
command: ["python", "test_gpu.py"]
env:
- name: LIBCUDA_LOG_LEVEL
value: "4"
resources:
requests:
cpu: "1"
memory: "1Gi"
nvidia.com/gpu: "1"
nvidia.com/gpucores: 20
#nvidia.com/gpumem: "4000"
limits:
cpu: "1"
memory: "1Gi"
nvidia.com/gpu: "1"
nvidia.com/gpucores: 20
#nvidia.com/gpumem: "4000"
volumeMounts:
- name: data-volume
mountPath: /workspace
- name: shm-volume
mountPath: /dev/shm
volumes:
- name: data-volume
hostPath:
path: /root/vgpu
type: Directory
- name: shm-volume
emptyDir:
medium: Memory