diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 99fe6c8..f31ec54 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -17,8 +17,8 @@ jobs: include: - os: ubuntu-latest python: "3.12" - - os: macOS-latest - python: "3.12" + # - os: macOS-latest + # python: "3.12" timeout-minutes: 15 steps: @@ -39,6 +39,6 @@ jobs: import lamindb as ln ln.Project(name='Arrayloader benchmarks v2').save() " - - run: python scripts/run_data_loading_benchmark_on_tahoe100m.py MappedCollection - - run: python scripts/run_data_loading_benchmark_on_tahoe100m.py scDataset - - run: python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch + - run: python scripts/run_loading_benchmark_on_collection.py MappedCollection + - run: python scripts/run_loading_benchmark_on_collection.py scDataset + - run: python scripts/run_loading_benchmark_on_collection.py annbatch diff --git a/README.md b/README.md index c958dff..d84d070 100644 --- a/README.md +++ b/README.md @@ -1,31 +1,43 @@ -# `arrayloader-benchmarks`: Data loader benchmarks for scRNA-seq counts et al. +# Data loader benchmarks for scRNA-seq counts et al. _A collaboration between scverse, Lamin, and anyone interested in contributing!_ This repository contains benchmarking scripts & utilities for scRNA-seq data loaders and allows to collaboratively contribute new benchmarking results. -A user can choose between different benchmarking dataset collections: +## Quickstart -https://lamin.ai/laminlabs/arrayloader-benchmarks/collections +Setup: -image +```bash +git clone https://github.com/laminlabs/arrayloader-benchmarks +cd arrayloader-benchmarks +uv pip install -e ".[scdataset,annbatch]" # provide tools you'd like to install +lamin connect laminlabs/arrayloader-benchmarks # to contribute results to the hosted lamindb instance, call `lamin init` to create a new lamindb instance +``` Typical calls of the main benchmarking script are: +```bash +cd scripts +python run_loading_benchmark_on_collection.py annbatch # run annbatch on collection Tahoe100M_tiny, n_datasets = 1 +python run_loading_benchmark_on_collection.py MappedCollection # run MappedCollection +python run_loading_benchmark_on_collection.py scDataset # run scDataset +python run_loading_benchmark_on_collection.py annbatch --n_datasets -1 # run against all datasets in the collection +python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets -1 # run against the full 100M cells +python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 1 # run against the the first dataset, 2M cells +python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 5 # run against the the first dataset, 10M cells ``` -python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch # run with collection Tahoe100M_tiny, n_datasets = 1 -python scripts/run_data_loading_benchmark_on_tahoe100m.py MappedCollection # run MappedCollection -python scripts/run_data_loading_benchmark_on_tahoe100m.py scDataset # run scDataset -python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --n_datasets -1 # run against all datasets in the collection -python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --collection Tahoe100M --n_datasets -1 # run against the full 100M cells -python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --collection Tahoe100M --n_datasets 1 # run against the the first dataset, 2M cells -python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --collection Tahoe100M --n_datasets 5 # run against the the first dataset, 10M cells -``` -Parameters and results for each run are automatically tracked in a parquet file. Source code and datasets are tracked via data lineage. +You can choose between different benchmarking [dataset collections](https://lamin.ai/laminlabs/arrayloader-benchmarks/collections). + +image +
+
-image +When running the script, [parameters and results](https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa) are automatically tracked in a parquet file, along with source code, run environment, and input and output datasets. -Results can be downloaded and reproduced from here: https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa +image +
+
Note: A previous version of this repo contained the benchmarking scripts accompanying the 2024 blog post: [lamin.ai/blog/arrayloader-benchmarks](https://lamin.ai/blog/arrayloader-benchmarks). diff --git a/scripts/run_data_loading_benchmark_on_tahoe100m.py b/scripts/run_loading_benchmark_on_collection.py similarity index 100% rename from scripts/run_data_loading_benchmark_on_tahoe100m.py rename to scripts/run_loading_benchmark_on_collection.py