GPU Health Check (Ubuntu, NVIDIA)

gpu_healthcheck.sh is a batteries-included playbook for testing used/unknown NVIDIA GPUs on Ubuntu. It collects hardware/driver info, logs temps/clocks/power at 1 Hz, runs headless OpenGL stress (glmark2), optional CUDA stress (gpu-burn), and an optional PyTorch CUDA micro-benchmark. All outputs are saved to a timestamped log folder.

What it does

Inventory: uname, lspci, nvidia-smi -L/-q
Telemetry (1 Hz): nvidia-smi … --format=csv -l 1 and nvidia-smi dmon
Sanity: glxinfo (first 50 lines)
OpenGL stress: glmark2 --off-screen (headless)
CUDA stress (optional): build & run gpu-burn if CUDA toolkit (nvcc) is present
Optional compute check: PyTorch CUDA matmul timing in a venv
Final snapshot: nvidia-smi -q after tests

⚠️ Designed for NVIDIA on Ubuntu. AMD/ROCm variant not included (ask if you want it).

Requirements

Ubuntu 20.04/22.04/24.04 with NVIDIA proprietary driver installed (nvidia-smi works)
Internet access (for apt and git), unless APT_INSTALL=0
CUDA toolkit only if you want gpu-burn (otherwise it’s skipped)

Quick start

# 1) Save the script
chmod +x gpu_healthcheck.sh

# 2) Run (will apt-install needed tools unless disabled)
./gpu_healthcheck.sh

After it finishes, check the created folder:

gpu_health_logs_YYYYMMDD_HHMMSS/

Key files:

baseline.txt – system & GPU inventory
nvidia_telemetry.csv – 1 Hz temps/power/util/clocks during the run
nvidia_dmon.log – per-sample stats
glmark2_offscreen.log – OpenGL stress score/output
gpu_burn_build.log, gpu_burn_run.log – present only if CUDA was found
torch_cuda_test.log – present only if PYTORCH_RUN=1
final_nvidia_smi_q.txt – end-state snapshot

Modes & environment variables

You can tweak behavior via env vars (set before the script command):

Variable	Default	What it does
`GLMARK2_DURATION_SEC`	`300`	Seconds to run glmark2 (off-screen). Increase for longer OpenGL stress.
`GPUBURN_DURATION_SEC`	`600`	Seconds to run gpu-burn (only if CUDA toolkit is installed).
`PYTORCH_RUN`	`0`	`1` = run a small PyTorch CUDA matmul test in an isolated venv.
`APT_INSTALL`	`1`	`0` = don’t apt-install anything (assumes tools already present).
`LOG_ROOT`	`"$PWD"`	Parent directory where the timestamped log folder is created.

Examples

Longer stress:

GLMARK2_DURATION_SEC=900 GPUBURN_DURATION_SEC=3600 ./gpu_healthcheck.sh

Skip apt installs (air-gapped / preprovisioned node):

APT_INSTALL=0 ./gpu_healthcheck.sh

Include PyTorch compute sanity:

PYTORCH_RUN=1 ./gpu_healthcheck.sh

Write logs to a different path:

LOG_ROOT=/var/log/gpuchecks ./gpu_healthcheck.sh

Interpreting results (accept/reject cues)

Thermals: Under sustained load, temps should stabilize (typ. < 85 °C). Sudden spikes + clock drops = throttling/cooling issue.
Stability: gpu_burn_run.log should finish without errors (if run). Kernel Xid errors or crashes are red flags.
VRAM/Artifacts: While glmark2 runs, watch for driver resets or log errors. (For visual artifact checks, also run a Unigine/GpuTest donut externally.)
Power/Clocks: In nvidia_telemetry.csv, power draw and clocks should rise and remain stable near expected values for the card/TDP.

Troubleshooting

nvidia-smi: command not found or “couldn’t communicate with driver”
Install/activate the NVIDIA proprietary driver and reboot. Secure Boot may require MOK enrollment.
gpu-burn skipped
You don’t have the CUDA toolkit (nvcc). Install CUDA for full stress or ignore if not needed.
Wayland/compositor overhead
The test uses off-screen rendering and should be fine headless. For gameplay testing, try an Xorg session if you see stutter.
Networking disabled
Set APT_INSTALL=0 and pre-install dependencies:
git build-essential glmark2 mesa-utils vulkan-tools nvtop python3 python3-venv python3-pip.

Safety notes

These tests push near-peak heat and power. Ensure adequate airflow and monitor temps while running. Stop if temps approach your card’s throttle/shutdown range (often mid-80s to 90 °C depending on model and cooler). Avoid running intensive tests on unstable power or in dusty/blocked enclosures.

Common workflows

Quick health check (5–10 min): default run → inspect glmark2_offscreen.log and nvidia_telemetry.csv.
Deep burn-in (1–12 hrs): raise GLMARK2_DURATION_SEC/GPUBURN_DURATION_SEC → confirm no errors or throttling.
Compute sanity for ML: PYTORCH_RUN=1 → check torch_cuda_test.log for CUDA availability and rough matmul timing.

If you want, I can add a companion script to zip and summarize the log folder or to integrate GpuTest/Unigine runs into the same log bundle.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
INTERPRETING_LOGS.md		INTERPRETING_LOGS.md
README.md		README.md
gpu_healthcheck.sh		gpu_healthcheck.sh
summarize_gpu_logs.py		summarize_gpu_logs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU Health Check (Ubuntu, NVIDIA)

What it does

Requirements

Quick start

Modes & environment variables

Examples

Interpreting results (accept/reject cues)

Troubleshooting

Safety notes

Common workflows

About

Uh oh!

Releases

Packages

Contributors 2

Languages

ahnilica/gpu-tester

Folders and files

Latest commit

History

Repository files navigation

GPU Health Check (Ubuntu, NVIDIA)

What it does

Requirements

Quick start

Modes & environment variables

Examples

Interpreting results (accept/reject cues)

Troubleshooting

Safety notes

Common workflows

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages