-
Notifications
You must be signed in to change notification settings - Fork 79
Open
Description
It is appeared that DCGM is not able detect GPU which is stuck in EBUSY state because of pended recovery aciton
Example:
# GPU=1 stuck in EBUSY state because some GSP issue
nvidia-smi --query-gpu=index,gpu_recovery_action --format=csv,noheader
0, None
1, Reset <= GPU is unusable, reset required
2, None
3, None
4, None
5, None
6, None
7, None
dcgmi dmon -c 1 -e 230
#Entity XIDER
ID
GPU 0 0
GPU 1 120
GPU 2 0
GPU 3 0
GPU 4 0
GPU 5 0
GPU 6 0
GPU 7 0
This means that GPU is completely unusable, but health check report nothing
$ dcgmi health -s a
Health monitor systems set successfully.
$ dcgmi health -c -j
{
"body" :
{
"Overall Health" :
{
"value" : "Healthy"
}
},
"header" :
[
"Health Monitor Report"
]
}
Metadata
Metadata
Assignees
Labels
No labels