DataStates-LLM: Asynchronous I/O Engine

DataStates-LLM is a high-performance I/O engine for GPU-accelerated workloads, with a particular focus on large-scale DeepSpeed/Megatron training. It provides lazy, asynchronous checkpointing backed by CUDA and io_uring, allowing you to overlap checkpoint I/O with forward/backward passes and reduce checkpoint overheads at scale.

For a detailed description of the design, implementation, and evaluation against state-of-the-art checkpointing engines, see our HPDC’24 paper:

Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, and Bogdan Nicolae.
“DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models”.
HPDC’24: The 33rd International Symposium on High-Performance Parallel and Distributed Computing (Pisa, Italy, 2024).

1. Features Overview

Lazy, asynchronous checkpointing for large language model training
GPU-aware checkpoint engine (CUDA) with host/device tiers
io_uring-based file I/O path for low-overhead persistence
C++ core library with a thin Python binding (nanobind)
Integrates with DeepSpeed’s async checkpointing API
Designed for HPC/AI environments (multi-GPU, multi-node)

2. Installation and Tests

2.1. Prerequisites

Linux (tested on recent x86_64 distributions)
C++17 compiler (GCC 11+ recommended)
CUDA toolkit (matching your cluster environment)
liburing development headers
CMake ≥ 3.15
Python ≥ 3.8
(For Python bindings)
- PyTorch
- fasteners
- nanobind (pulled in automatically by install.sh)

2.2. Clone and Build

git clone https://github.com/DataStates/datastates-llm.git
cd datastates-llm/

# Activate your target Python/conda environment first.
# By default, install.sh will:
#   - detect the active Python env
#   - install into its site-packages
#   - build C++ core + Python bindings
./install.sh

By default, install.sh builds the C++ core library and the Python bindings and installs into the active environment's site-packages.

You can also control the install prefix and whether Python bindings are built:

# 1st arg: install prefix (optional)
# 2nd arg: build Python bindings? [on/off/yes/no/1/0]

# Example: install into a custom prefix WITH Python bindings
./install.sh /path/to/prefix on

# Example: install core library only, no Python bindings,
# into the active Python environment’s site-packages
./install.sh "" off

2.3. Tests

C++ core engine test

# Run the test binary directly
./build/tests/test_core_engine
# Or run through ctests
cd build/
ctest

Python tests

python tests/python/test_base_state_provider.py # Without DeepSpeed
python tests/python/test_llm_ckpt_state_engine.py # With DeepSpeed

3. Using DataStates-LLM with DeepSpeed

DeepSpeed provides an official tutorial on enabling DataStates-based asynchronous checkpointing through a single JSON entry in the config file: Official DeepSpeed Tutorial

That tutorial covers:

Configuring DeepSpeed to use DataStates-LLM as the asynchronous checkpoint backend
Relevant DeepSpeed configuration options
Example training scripts integrating DataStates-LLM

4. Contributions and Issues

We welcome feedback, bug reports, and contributions.

File issues and feature requests via the GitHub Issue tracker.
Contributions in the form of bug fixes, portability improvements, and integration with additional frameworks are particularly appreciated.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
include		include
llm		llm
python		python
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataStates-LLM: Asynchronous I/O Engine

1. Features Overview

2. Installation and Tests

2.1. Prerequisites

2.2. Clone and Build

2.3. Tests

3. Using DataStates-LLM with DeepSpeed

4. Contributions and Issues

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

DataStates/datastates-llm

Folders and files

Latest commit

History

Repository files navigation

DataStates-LLM: Asynchronous I/O Engine

1. Features Overview

2. Installation and Tests

2.1. Prerequisites

2.2. Clone and Build

2.3. Tests

3. Using DataStates-LLM with DeepSpeed

4. Contributions and Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages