gpu_healthcheck.sh is a batteries-included playbook for testing used/unknown NVIDIA GPUs on Ubuntu. It collects hardware/driver info, logs temps/clocks/power at 1 Hz, runs headless OpenGL stress (glmark2), optional CUDA stress (gpu-burn), and an optional PyTorch CUDA micro-benchmark. All outputs are saved to a timestamped log folder.
- Inventory:
uname,lspci,nvidia-smi -L/-q - Telemetry (1 Hz):
nvidia-smi … --format=csv -l 1andnvidia-smi dmon - Sanity:
glxinfo(first 50 lines) - OpenGL stress:
glmark2 --off-screen(headless) - CUDA stress (optional): build & run gpu-burn if CUDA toolkit (
nvcc) is present - Optional compute check: PyTorch CUDA matmul timing in a venv
- Final snapshot:
nvidia-smi -qafter tests
⚠️ Designed for NVIDIA on Ubuntu. AMD/ROCm variant not included (ask if you want it).
- Ubuntu 20.04/22.04/24.04 with NVIDIA proprietary driver installed (
nvidia-smiworks) - Internet access (for
aptandgit), unlessAPT_INSTALL=0 - CUDA toolkit only if you want gpu-burn (otherwise it’s skipped)
# 1) Save the script
chmod +x gpu_healthcheck.sh
# 2) Run (will apt-install needed tools unless disabled)
./gpu_healthcheck.shAfter it finishes, check the created folder:
gpu_health_logs_YYYYMMDD_HHMMSS/
Key files:
baseline.txt– system & GPU inventorynvidia_telemetry.csv– 1 Hz temps/power/util/clocks during the runnvidia_dmon.log– per-sample statsglmark2_offscreen.log– OpenGL stress score/outputgpu_burn_build.log,gpu_burn_run.log– present only if CUDA was foundtorch_cuda_test.log– present only if PYTORCH_RUN=1final_nvidia_smi_q.txt– end-state snapshot
You can tweak behavior via env vars (set before the script command):
| Variable | Default | What it does |
|---|---|---|
GLMARK2_DURATION_SEC |
300 |
Seconds to run glmark2 (off-screen). Increase for longer OpenGL stress. |
GPUBURN_DURATION_SEC |
600 |
Seconds to run gpu-burn (only if CUDA toolkit is installed). |
PYTORCH_RUN |
0 |
1 = run a small PyTorch CUDA matmul test in an isolated venv. |
APT_INSTALL |
1 |
0 = don’t apt-install anything (assumes tools already present). |
LOG_ROOT |
"$PWD" |
Parent directory where the timestamped log folder is created. |
Longer stress:
GLMARK2_DURATION_SEC=900 GPUBURN_DURATION_SEC=3600 ./gpu_healthcheck.shSkip apt installs (air-gapped / preprovisioned node):
APT_INSTALL=0 ./gpu_healthcheck.shInclude PyTorch compute sanity:
PYTORCH_RUN=1 ./gpu_healthcheck.shWrite logs to a different path:
LOG_ROOT=/var/log/gpuchecks ./gpu_healthcheck.sh- Thermals: Under sustained load, temps should stabilize (typ. < 85 °C). Sudden spikes + clock drops = throttling/cooling issue.
- Stability:
gpu_burn_run.logshould finish without errors (if run). KernelXiderrors or crashes are red flags. - VRAM/Artifacts: While glmark2 runs, watch for driver resets or log errors. (For visual artifact checks, also run a Unigine/GpuTest donut externally.)
- Power/Clocks: In
nvidia_telemetry.csv, power draw and clocks should rise and remain stable near expected values for the card/TDP.
nvidia-smi: command not foundor “couldn’t communicate with driver”
Install/activate the NVIDIA proprietary driver and reboot. Secure Boot may require MOK enrollment.gpu-burnskipped
You don’t have the CUDA toolkit (nvcc). Install CUDA for full stress or ignore if not needed.- Wayland/compositor overhead
The test uses off-screen rendering and should be fine headless. For gameplay testing, try an Xorg session if you see stutter. - Networking disabled
SetAPT_INSTALL=0and pre-install dependencies:
git build-essential glmark2 mesa-utils vulkan-tools nvtop python3 python3-venv python3-pip.
These tests push near-peak heat and power. Ensure adequate airflow and monitor temps while running. Stop if temps approach your card’s throttle/shutdown range (often mid-80s to 90 °C depending on model and cooler). Avoid running intensive tests on unstable power or in dusty/blocked enclosures.
- Quick health check (5–10 min): default run → inspect
glmark2_offscreen.logandnvidia_telemetry.csv. - Deep burn-in (1–12 hrs): raise
GLMARK2_DURATION_SEC/GPUBURN_DURATION_SEC→ confirm no errors or throttling. - Compute sanity for ML:
PYTORCH_RUN=1→ checktorch_cuda_test.logfor CUDA availability and rough matmul timing.
If you want, I can add a companion script to zip and summarize the log folder or to integrate GpuTest/Unigine runs into the same log bundle.