DataStates-LLM is a high-performance I/O engine for GPU-accelerated workloads, with a particular focus on large-scale DeepSpeed/Megatron training. It provides lazy, asynchronous checkpointing backed by CUDA and io_uring, allowing you to overlap checkpoint I/O with forward/backward passes and reduce checkpoint overheads at scale.
For a detailed description of the design, implementation, and evaluation against state-of-the-art checkpointing engines, see our HPDC’24 paper:
Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae.
“DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models”.
HPDC’24: The 33rd International Symposium on High-Performance Parallel and Distributed Computing (Pisa, Italy, 2024).
- Lazy, asynchronous checkpointing for large language model training
- GPU-aware checkpoint engine (CUDA) with host/device tiers
io_uring-based file I/O path for low-overhead persistence- C++ core library with a thin Python binding (nanobind)
- Integrates with DeepSpeed’s async checkpointing API
- Designed for HPC/AI environments (multi-GPU, multi-node)
- Linux (tested on recent x86_64 distributions)
- C++17 compiler (GCC 11+ recommended)
- CUDA toolkit (matching your cluster environment)
liburingdevelopment headers- CMake ≥ 3.15
- Python ≥ 3.8
- (For Python bindings)
- PyTorch
- fasteners
nanobind(pulled in automatically byinstall.sh)
git clone https://github.com/DataStates/datastates-llm.git
cd datastates-llm/
# Activate your target Python/conda environment first.
# By default, install.sh will:
# - detect the active Python env
# - install into its site-packages
# - build C++ core + Python bindings
./install.shBy default, install.sh builds the C++ core library and the Python bindings and installs into the active environment's site-packages.
You can also control the install prefix and whether Python bindings are built:
# 1st arg: install prefix (optional)
# 2nd arg: build Python bindings? [on/off/yes/no/1/0]
# Example: install into a custom prefix WITH Python bindings
./install.sh /path/to/prefix on
# Example: install core library only, no Python bindings,
# into the active Python environment’s site-packages
./install.sh "" offC++ core engine test
# Run the test binary directly
./build/tests/test_core_engine
# Or run through ctests
cd build/
ctest Python tests
python tests/python/test_base_state_provider.py # Without DeepSpeed
python tests/python/test_llm_ckpt_state_engine.py # With DeepSpeedDeepSpeed provides an official tutorial on enabling DataStates-based asynchronous checkpointing through a single JSON entry in the config file: Official DeepSpeed Tutorial
That tutorial covers:
- Configuring DeepSpeed to use DataStates-LLM as the asynchronous checkpoint backend
- Relevant DeepSpeed configuration options
- Example training scripts integrating DataStates-LLM
We welcome feedback, bug reports, and contributions.
- File issues and feature requests via the GitHub Issue tracker.
- Contributions in the form of bug fixes, portability improvements, and integration with additional frameworks are particularly appreciated.