Skip to content

ahnilica/gpu-tester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Health Check (Ubuntu, NVIDIA)

gpu_healthcheck.sh is a batteries-included playbook for testing used/unknown NVIDIA GPUs on Ubuntu. It collects hardware/driver info, logs temps/clocks/power at 1 Hz, runs headless OpenGL stress (glmark2), optional CUDA stress (gpu-burn), and an optional PyTorch CUDA micro-benchmark. All outputs are saved to a timestamped log folder.


What it does

  • Inventory: uname, lspci, nvidia-smi -L/-q
  • Telemetry (1 Hz): nvidia-smi … --format=csv -l 1 and nvidia-smi dmon
  • Sanity: glxinfo (first 50 lines)
  • OpenGL stress: glmark2 --off-screen (headless)
  • CUDA stress (optional): build & run gpu-burn if CUDA toolkit (nvcc) is present
  • Optional compute check: PyTorch CUDA matmul timing in a venv
  • Final snapshot: nvidia-smi -q after tests

⚠️ Designed for NVIDIA on Ubuntu. AMD/ROCm variant not included (ask if you want it).


Requirements

  • Ubuntu 20.04/22.04/24.04 with NVIDIA proprietary driver installed (nvidia-smi works)
  • Internet access (for apt and git), unless APT_INSTALL=0
  • CUDA toolkit only if you want gpu-burn (otherwise it’s skipped)

Quick start

# 1) Save the script
chmod +x gpu_healthcheck.sh

# 2) Run (will apt-install needed tools unless disabled)
./gpu_healthcheck.sh

After it finishes, check the created folder:

gpu_health_logs_YYYYMMDD_HHMMSS/

Key files:

  • baseline.txt – system & GPU inventory
  • nvidia_telemetry.csv – 1 Hz temps/power/util/clocks during the run
  • nvidia_dmon.log – per-sample stats
  • glmark2_offscreen.log – OpenGL stress score/output
  • gpu_burn_build.log, gpu_burn_run.log – present only if CUDA was found
  • torch_cuda_test.log – present only if PYTORCH_RUN=1
  • final_nvidia_smi_q.txt – end-state snapshot

Modes & environment variables

You can tweak behavior via env vars (set before the script command):

Variable Default What it does
GLMARK2_DURATION_SEC 300 Seconds to run glmark2 (off-screen). Increase for longer OpenGL stress.
GPUBURN_DURATION_SEC 600 Seconds to run gpu-burn (only if CUDA toolkit is installed).
PYTORCH_RUN 0 1 = run a small PyTorch CUDA matmul test in an isolated venv.
APT_INSTALL 1 0 = don’t apt-install anything (assumes tools already present).
LOG_ROOT "$PWD" Parent directory where the timestamped log folder is created.

Examples

Longer stress:

GLMARK2_DURATION_SEC=900 GPUBURN_DURATION_SEC=3600 ./gpu_healthcheck.sh

Skip apt installs (air-gapped / preprovisioned node):

APT_INSTALL=0 ./gpu_healthcheck.sh

Include PyTorch compute sanity:

PYTORCH_RUN=1 ./gpu_healthcheck.sh

Write logs to a different path:

LOG_ROOT=/var/log/gpuchecks ./gpu_healthcheck.sh

Interpreting results (accept/reject cues)

  • Thermals: Under sustained load, temps should stabilize (typ. < 85 °C). Sudden spikes + clock drops = throttling/cooling issue.
  • Stability: gpu_burn_run.log should finish without errors (if run). Kernel Xid errors or crashes are red flags.
  • VRAM/Artifacts: While glmark2 runs, watch for driver resets or log errors. (For visual artifact checks, also run a Unigine/GpuTest donut externally.)
  • Power/Clocks: In nvidia_telemetry.csv, power draw and clocks should rise and remain stable near expected values for the card/TDP.

Troubleshooting

  • nvidia-smi: command not found or “couldn’t communicate with driver”
    Install/activate the NVIDIA proprietary driver and reboot. Secure Boot may require MOK enrollment.
  • gpu-burn skipped
    You don’t have the CUDA toolkit (nvcc). Install CUDA for full stress or ignore if not needed.
  • Wayland/compositor overhead
    The test uses off-screen rendering and should be fine headless. For gameplay testing, try an Xorg session if you see stutter.
  • Networking disabled
    Set APT_INSTALL=0 and pre-install dependencies:
    git build-essential glmark2 mesa-utils vulkan-tools nvtop python3 python3-venv python3-pip.

Safety notes

These tests push near-peak heat and power. Ensure adequate airflow and monitor temps while running. Stop if temps approach your card’s throttle/shutdown range (often mid-80s to 90 °C depending on model and cooler). Avoid running intensive tests on unstable power or in dusty/blocked enclosures.


Common workflows

  • Quick health check (5–10 min): default run → inspect glmark2_offscreen.log and nvidia_telemetry.csv.
  • Deep burn-in (1–12 hrs): raise GLMARK2_DURATION_SEC/GPUBURN_DURATION_SEC → confirm no errors or throttling.
  • Compute sanity for ML: PYTORCH_RUN=1 → check torch_cuda_test.log for CUDA availability and rough matmul timing.

If you want, I can add a companion script to zip and summarize the log folder or to integrate GpuTest/Unigine runs into the same log bundle.

About

A suite of programs to test the condition of a GPU (runs on Ubuntu)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published