GPU Cluster Capacity Simulator

A lightweight simulator for GPU cluster capacity planning for AI workloads.
Models utilization, queueing delay, packing / fragmentation, and a simplified cost view.

This project is intentionally infrastructure-focused.
It demonstrates how AI compute systems behave under load, rather than how to train or optimize ML models.

Demo

Open the Live Streamlit Demo

What this simulator models

Given a cluster configuration and workload arrival patterns, the simulator estimates:

GPU utilization over time
Queue depth and wait time (including P50 / P95 indicators)
Inference SLA violations
Cost efficiency (GPU-hours with overhead)

This mirrors the real tradeoffs faced in AI infrastructure, platform, and MLOps environments.

Architecture

Execution flow

Workload generator creates training and inference jobs
Scheduler assigns jobs to GPUs (capturing packing + fragmentation effects)
Simulator advances time and tracks system state
Metrics and reports are generated for analysis

Quickstart

1) Install

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install -e ".[dev]"

2) Run a simulation

gpu-sim run --config configs/base.yaml \
  --out results/base_tick.csv \
  --out-jobs results/base_jobs.csv

Outputs

results/*_tick.csv → utilization, queue depth, running/completed jobs
results/*_jobs.csv → per-job wait time, assignment, SLA violations

Scenarios Included

Config	Purpose
`configs/base.yaml`	Balanced training + inference baseline
`configs/mixed_workloads.yaml`	Contention and fragmentation under mixed loads
`configs/inference_spike.yaml`	SLA pressure during inference bursts
`configs/nvl_vs_pcie.yaml`	Sensitivity to reduced interconnect efficiency

Run all scenarios:

gpu-sim run --config configs/base.yaml --out results/base_tick.csv --out-jobs results/base_jobs.csv
gpu-sim run --config configs/mixed_workloads.yaml --out results/mixed_tick.csv --out-jobs results/mixed_jobs.csv
gpu-sim run --config configs/inference_spike.yaml --out results/spike_tick.csv --out-jobs results/spike_jobs.csv
gpu-sim run --config configs/nvl_vs_pcie.yaml --out results/pcie_tick.csv --out-jobs results/pcie_jobs.csv

Results & Insights

GPU Utilization

Insight

High average utilization does not guarantee low latency
Mixed workloads reduce effective throughput due to fragmentation

Queue Depth & SLA Pressure

Insight

Queue depth grows rapidly under bursty inference traffic
SLA violations often appear before GPUs look fully saturated

Streamlit Dashboard (Live, Interactive)

Run an interactive dashboard to compare scenarios and visualize utilization and queueing behavior.

pip install -r requirements.txt
streamlit run app.py

The dashboard supports:

Scenario selection
Live simulation execution
KPI cards (utilization, P95 wait, SLA rate, cost)
Time-series plots for utilization and queue depth

Development & Quality

pytest -q
ruff check .

CI runs automatically via GitHub Actions.

Why this project exists

AI infrastructure performance is rarely a pure “GPU count” problem.

This simulator highlights why real-world outcomes depend on:

Workload mix (training vs inference)
Scheduling and packing
Queueing dynamics
SLA design
Cost vs reliability tradeoffs

These considerations are central to modern AI platforms, cloud infrastructure, and MLOps systems.

Roadmap

Pool partitioning (training vs inference)
Priority scheduling and preemption
Trace-driven workloads
Multi-cluster comparisons
More detailed cost models (power, networking)

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
configs		configs
docs/images		docs/images
src/gpu_sim		src/gpu_sim
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU Cluster Capacity Simulator

Demo

What this simulator models

Architecture

Quickstart

1) Install

2) Run a simulation

Scenarios Included

Results & Insights

GPU Utilization

Queue Depth & SLA Pressure

Streamlit Dashboard (Live, Interactive)

Development & Quality

Why this project exists

Roadmap

About

Uh oh!

Releases

Packages

Languages

License

QuantuMaster007/GPU-Cluster-Capacity-Simulator

Folders and files

Latest commit

History

Repository files navigation

GPU Cluster Capacity Simulator

Demo

What this simulator models

Architecture

Quickstart

1) Install

2) Run a simulation

Scenarios Included

Results & Insights

GPU Utilization

Queue Depth & SLA Pressure

Streamlit Dashboard (Live, Interactive)

Development & Quality

Why this project exists

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages