Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ jobs:
include:
- os: ubuntu-latest
python: "3.12"
- os: macOS-latest
python: "3.12"
# - os: macOS-latest
# python: "3.12"
timeout-minutes: 15

steps:
Expand All @@ -39,6 +39,6 @@ jobs:
import lamindb as ln
ln.Project(name='Arrayloader benchmarks v2').save()
"
- run: python scripts/run_data_loading_benchmark_on_tahoe100m.py MappedCollection
- run: python scripts/run_data_loading_benchmark_on_tahoe100m.py scDataset
- run: python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch
- run: python scripts/run_loading_benchmark_on_collection.py MappedCollection
- run: python scripts/run_loading_benchmark_on_collection.py scDataset
- run: python scripts/run_loading_benchmark_on_collection.py annbatch
42 changes: 27 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,43 @@
# `arrayloader-benchmarks`: Data loader benchmarks for scRNA-seq counts et al.
# Data loader benchmarks for scRNA-seq counts et al.

_A collaboration between scverse, Lamin, and anyone interested in contributing!_

This repository contains benchmarking scripts & utilities for scRNA-seq data loaders and allows to collaboratively contribute new benchmarking results.

A user can choose between different benchmarking dataset collections:
## Quickstart

https://lamin.ai/laminlabs/arrayloader-benchmarks/collections
Setup:

<img width="500" height="481" alt="image" src="https://github.com/user-attachments/assets/b539b13a-9b50-4f66-9b51-16d32fd8566b" />
```bash
git clone https://github.com/laminlabs/arrayloader-benchmarks
cd arrayloader-benchmarks
uv pip install -e ".[scdataset,annbatch]" # provide tools you'd like to install
lamin connect laminlabs/arrayloader-benchmarks # to contribute results to the hosted lamindb instance, call `lamin init` to create a new lamindb instance
```

Typical calls of the main benchmarking script are:

```bash
cd scripts
python run_loading_benchmark_on_collection.py annbatch # run annbatch on collection Tahoe100M_tiny, n_datasets = 1
python run_loading_benchmark_on_collection.py MappedCollection # run MappedCollection
python run_loading_benchmark_on_collection.py scDataset # run scDataset
python run_loading_benchmark_on_collection.py annbatch --n_datasets -1 # run against all datasets in the collection
python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets -1 # run against the full 100M cells
python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 1 # run against the the first dataset, 2M cells
python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 5 # run against the the first dataset, 10M cells
```
python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch # run with collection Tahoe100M_tiny, n_datasets = 1
python scripts/run_data_loading_benchmark_on_tahoe100m.py MappedCollection # run MappedCollection
python scripts/run_data_loading_benchmark_on_tahoe100m.py scDataset # run scDataset
python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --n_datasets -1 # run against all datasets in the collection
python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --collection Tahoe100M --n_datasets -1 # run against the full 100M cells
python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --collection Tahoe100M --n_datasets 1 # run against the the first dataset, 2M cells
python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --collection Tahoe100M --n_datasets 5 # run against the the first dataset, 10M cells
```

Parameters and results for each run are automatically tracked in a parquet file. Source code and datasets are tracked via data lineage.
You can choose between different benchmarking [dataset collections](https://lamin.ai/laminlabs/arrayloader-benchmarks/collections).

<img width="1000" height="600" alt="image" src="https://github.com/user-attachments/assets/b539b13a-9b50-4f66-9b51-16d32fd8566b" />
<br>
<br>

<img width="1298" height="904" alt="image" src="https://github.com/user-attachments/assets/60c3262f-1bdc-44a4-a488-4784918a6905" />
When running the script, [parameters and results](https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa) are automatically tracked in a parquet file, along with source code, run environment, and input and output datasets.

Results can be downloaded and reproduced from here: https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa
<img width="1000" height="904" alt="image" src="https://github.com/user-attachments/assets/60c3262f-1bdc-44a4-a488-4784918a6905" />
<br>
<br>

Note: A previous version of this repo contained the benchmarking scripts accompanying the 2024 blog post: [lamin.ai/blog/arrayloader-benchmarks](https://lamin.ai/blog/arrayloader-benchmarks).
Loading