From a7325c42ba11fd5b2fd1b2c20e0fc39b3943ed33 Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:36:14 +0200 Subject: [PATCH 01/13] =?UTF-8?q?=F0=9F=93=9D=20Polish=20readme?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .github/workflows/build.yml | 6 ++--- README.md | 26 +++++++++---------- ...=> run_loading_benchmark_on_collection.py} | 0 3 files changed, 15 insertions(+), 17 deletions(-) rename scripts/{run_data_loading_benchmark_on_tahoe100m.py => run_loading_benchmark_on_collection.py} (100%) diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 99fe6c8..ed3490e 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -39,6 +39,6 @@ jobs: import lamindb as ln ln.Project(name='Arrayloader benchmarks v2').save() " - - run: python scripts/run_data_loading_benchmark_on_tahoe100m.py MappedCollection - - run: python scripts/run_data_loading_benchmark_on_tahoe100m.py scDataset - - run: python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch + - run: python run_loading_benchmark_on_collection.py MappedCollection + - run: python run_loading_benchmark_on_collection.py scDataset + - run: python run_loading_benchmark_on_collection.py annbatch diff --git a/README.md b/README.md index c958dff..70fdbbc 100644 --- a/README.md +++ b/README.md @@ -4,28 +4,26 @@ _A collaboration between scverse, Lamin, and anyone interested in contributing!_ This repository contains benchmarking scripts & utilities for scRNA-seq data loaders and allows to collaboratively contribute new benchmarking results. -A user can choose between different benchmarking dataset collections: +A user can choose between different benchmarking [dataset collections](https://lamin.ai/laminlabs/arrayloader-benchmarks/collections). -https://lamin.ai/laminlabs/arrayloader-benchmarks/collections - -image +image Typical calls of the main benchmarking script are: ``` -python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch # run with collection Tahoe100M_tiny, n_datasets = 1 -python scripts/run_data_loading_benchmark_on_tahoe100m.py MappedCollection # run MappedCollection -python scripts/run_data_loading_benchmark_on_tahoe100m.py scDataset # run scDataset -python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --n_datasets -1 # run against all datasets in the collection -python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --collection Tahoe100M --n_datasets -1 # run against the full 100M cells -python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --collection Tahoe100M --n_datasets 1 # run against the the first dataset, 2M cells -python scripts/run_data_loading_benchmark_on_tahoe100m.py annbatch --collection Tahoe100M --n_datasets 5 # run against the the first dataset, 10M cells +git clone https://github.com/laminlabs/arrayloader-benchmarks +cd scripts +python run_loading_benchmark_on_collection.py annbatch # run with collection Tahoe100M_tiny, n_datasets = 1 +python run_loading_benchmark_on_collection.py MappedCollection # run MappedCollection +python run_loading_benchmark_on_collection.py scDataset # run scDataset +python run_loading_benchmark_on_collection.py annbatch --n_datasets -1 # run against all datasets in the collection +python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets -1 # run against the full 100M cells +python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 1 # run against the the first dataset, 2M cells +python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 5 # run against the the first dataset, 10M cells ``` -Parameters and results for each run are automatically tracked in a parquet file. Source code and datasets are tracked via data lineage. +When running the script, [parameters and results](https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa) are automatically tracked in a parquet file, along with source code and datasets. image -Results can be downloaded and reproduced from here: https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa - Note: A previous version of this repo contained the benchmarking scripts accompanying the 2024 blog post: [lamin.ai/blog/arrayloader-benchmarks](https://lamin.ai/blog/arrayloader-benchmarks). diff --git a/scripts/run_data_loading_benchmark_on_tahoe100m.py b/scripts/run_loading_benchmark_on_collection.py similarity index 100% rename from scripts/run_data_loading_benchmark_on_tahoe100m.py rename to scripts/run_loading_benchmark_on_collection.py From b3fd56add06b13af8f057109e388f1ff9e6524be Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:38:48 +0200 Subject: [PATCH 02/13] =?UTF-8?q?=F0=9F=92=84=20Prettier?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 70fdbbc..d169a3d 100644 --- a/README.md +++ b/README.md @@ -4,14 +4,17 @@ _A collaboration between scverse, Lamin, and anyone interested in contributing!_ This repository contains benchmarking scripts & utilities for scRNA-seq data loaders and allows to collaboratively contribute new benchmarking results. -A user can choose between different benchmarking [dataset collections](https://lamin.ai/laminlabs/arrayloader-benchmarks/collections). +You can choose between different benchmarking [dataset collections](https://lamin.ai/laminlabs/arrayloader-benchmarks/collections). image +
Typical calls of the main benchmarking script are: ``` git clone https://github.com/laminlabs/arrayloader-benchmarks +cd arrayloader-benchmarks +uv pip install --system -e .[scdataset,annbatch] # provide tools you'd like to install cd scripts python run_loading_benchmark_on_collection.py annbatch # run with collection Tahoe100M_tiny, n_datasets = 1 python run_loading_benchmark_on_collection.py MappedCollection # run MappedCollection @@ -25,5 +28,6 @@ python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M -- When running the script, [parameters and results](https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa) are automatically tracked in a parquet file, along with source code and datasets. image +
Note: A previous version of this repo contained the benchmarking scripts accompanying the 2024 blog post: [lamin.ai/blog/arrayloader-benchmarks](https://lamin.ai/blog/arrayloader-benchmarks). From 263976bd4646589010532efb47f568239dbf7876 Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:39:25 +0200 Subject: [PATCH 03/13] =?UTF-8?q?=E2=99=BB=EF=B8=8F=20More=20vertical=20sp?= =?UTF-8?q?ace?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index d169a3d..77e4da2 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,7 @@ You can choose between different benchmarking [dataset collections](https://lami image
+
Typical calls of the main benchmarking script are: @@ -29,5 +30,6 @@ When running the script, [parameters and results](https://lamin.ai/laminlabs/arr image
+
Note: A previous version of this repo contained the benchmarking scripts accompanying the 2024 blog post: [lamin.ai/blog/arrayloader-benchmarks](https://lamin.ai/blog/arrayloader-benchmarks). From 78acaa2a03fd182a415176e983eb654c3b73e573 Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:40:33 +0200 Subject: [PATCH 04/13] =?UTF-8?q?=E2=99=BB=EF=B8=8F=20Prettify?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 77e4da2..6e5b89a 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,7 @@ _A collaboration between scverse, Lamin, and anyone interested in contributing!_ This repository contains benchmarking scripts & utilities for scRNA-seq data loaders and allows to collaboratively contribute new benchmarking results. -You can choose between different benchmarking [dataset collections](https://lamin.ai/laminlabs/arrayloader-benchmarks/collections). - -image -
-
+## Quickstart Typical calls of the main benchmarking script are: @@ -25,6 +21,14 @@ python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M -- python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 1 # run against the the first dataset, 2M cells python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 5 # run against the the first dataset, 10M cells ``` +
+
+ +You can choose between different benchmarking [dataset collections](https://lamin.ai/laminlabs/arrayloader-benchmarks/collections). + +image +
+
When running the script, [parameters and results](https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa) are automatically tracked in a parquet file, along with source code and datasets. From 8b5cc1c2c0059a9a5ad21641fff7027392bcb19d Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:41:37 +0200 Subject: [PATCH 05/13] =?UTF-8?q?=F0=9F=92=84=20Prettier?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 6e5b89a..d1dff0c 100644 --- a/README.md +++ b/README.md @@ -8,12 +8,12 @@ This repository contains benchmarking scripts & utilities for scRNA-seq data loa Typical calls of the main benchmarking script are: -``` +```bash git clone https://github.com/laminlabs/arrayloader-benchmarks cd arrayloader-benchmarks -uv pip install --system -e .[scdataset,annbatch] # provide tools you'd like to install +uv pip install --system -e ".[scdataset,annbatch]" # provide tools you'd like to install cd scripts -python run_loading_benchmark_on_collection.py annbatch # run with collection Tahoe100M_tiny, n_datasets = 1 +python run_loading_benchmark_on_collection.py annbatch # run annbatch on collection Tahoe100M_tiny, n_datasets = 1 python run_loading_benchmark_on_collection.py MappedCollection # run MappedCollection python run_loading_benchmark_on_collection.py scDataset # run scDataset python run_loading_benchmark_on_collection.py annbatch --n_datasets -1 # run against all datasets in the collection From 4bffdedd30be98d83e83f7138a835f6e022216fc Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:42:23 +0200 Subject: [PATCH 06/13] =?UTF-8?q?=E2=99=BB=EF=B8=8F=20Prettier?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index d1dff0c..848a854 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,6 @@ python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M -- python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 5 # run against the the first dataset, 10M cells ```
-
You can choose between different benchmarking [dataset collections](https://lamin.ai/laminlabs/arrayloader-benchmarks/collections). @@ -30,7 +29,7 @@ You can choose between different benchmarking [dataset collections](https://lami

-When running the script, [parameters and results](https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa) are automatically tracked in a parquet file, along with source code and datasets. +When running the script, [parameters and results](https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa) are automatically tracked in a parquet file, along with source code, run environment, and input and output datasets. image
From c44c1a80bef595ef943d6b6622c0800c40b600ae Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:42:50 +0200 Subject: [PATCH 07/13] =?UTF-8?q?=F0=9F=92=84=20Format?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index 848a854..8cc2c21 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,6 @@ python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M -- python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 1 # run against the the first dataset, 2M cells python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M --n_datasets 5 # run against the the first dataset, 10M cells ``` -
You can choose between different benchmarking [dataset collections](https://lamin.ai/laminlabs/arrayloader-benchmarks/collections). From 9325d28fc87337f506a5144b2bf208b5d9970ca9 Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:44:12 +0200 Subject: [PATCH 08/13] =?UTF-8?q?=F0=9F=92=84=20Format?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 8cc2c21..efbd37c 100644 --- a/README.md +++ b/README.md @@ -24,13 +24,13 @@ python run_loading_benchmark_on_collection.py annbatch --collection Tahoe100M -- You can choose between different benchmarking [dataset collections](https://lamin.ai/laminlabs/arrayloader-benchmarks/collections). -image +image

When running the script, [parameters and results](https://lamin.ai/laminlabs/arrayloader-benchmarks/artifact/0EiozNVjberZTFHa) are automatically tracked in a parquet file, along with source code, run environment, and input and output datasets. -image +image

From 2ac06e6d1a609ce340732c5dd732701243b0d3fd Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:44:56 +0200 Subject: [PATCH 09/13] =?UTF-8?q?=F0=9F=91=B7=20No=20need=20for=20macOS=20?= =?UTF-8?q?runner?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .github/workflows/build.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index ed3490e..04b4e7f 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -17,8 +17,8 @@ jobs: include: - os: ubuntu-latest python: "3.12" - - os: macOS-latest - python: "3.12" + # - os: macOS-latest + # python: "3.12" timeout-minutes: 15 steps: From 22a57a8b7f0a942773e6817d86e6c6dfbc8e3d12 Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:47:38 +0200 Subject: [PATCH 10/13] =?UTF-8?q?=F0=9F=92=9A=20Fix=20and=20complete?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .github/workflows/build.yml | 1 + README.md | 4 +++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 04b4e7f..b0ca937 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -39,6 +39,7 @@ jobs: import lamindb as ln ln.Project(name='Arrayloader benchmarks v2').save() " + - run: cd scripts - run: python run_loading_benchmark_on_collection.py MappedCollection - run: python run_loading_benchmark_on_collection.py scDataset - run: python run_loading_benchmark_on_collection.py annbatch diff --git a/README.md b/README.md index efbd37c..3f73dbf 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,9 @@ Typical calls of the main benchmarking script are: ```bash git clone https://github.com/laminlabs/arrayloader-benchmarks cd arrayloader-benchmarks -uv pip install --system -e ".[scdataset,annbatch]" # provide tools you'd like to install +uv pip install -e ".[scdataset,annbatch]" # provide tools you'd like to install +lamin init # to init a new lamindb instance +# lamin connect laminlabs/arrayloader-benchmarks # to contribute results to the hosted lamindb instance cd scripts python run_loading_benchmark_on_collection.py annbatch # run annbatch on collection Tahoe100M_tiny, n_datasets = 1 python run_loading_benchmark_on_collection.py MappedCollection # run MappedCollection From 8a58b198795836b9fbb1cb64e86d000231c231f2 Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:50:01 +0200 Subject: [PATCH 11/13] =?UTF-8?q?=F0=9F=92=9A=20Fix?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .github/workflows/build.yml | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index b0ca937..f31ec54 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -39,7 +39,6 @@ jobs: import lamindb as ln ln.Project(name='Arrayloader benchmarks v2').save() " - - run: cd scripts - - run: python run_loading_benchmark_on_collection.py MappedCollection - - run: python run_loading_benchmark_on_collection.py scDataset - - run: python run_loading_benchmark_on_collection.py annbatch + - run: python scripts/run_loading_benchmark_on_collection.py MappedCollection + - run: python scripts/run_loading_benchmark_on_collection.py scDataset + - run: python scripts/run_loading_benchmark_on_collection.py annbatch From 0fb795540f85dda10075283641871f94bf5d273f Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:51:19 +0200 Subject: [PATCH 12/13] =?UTF-8?q?=F0=9F=92=9A=20Fix?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3f73dbf..3de77e4 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# `arrayloader-benchmarks`: Data loader benchmarks for scRNA-seq counts et al. +# Data loader benchmarks for scRNA-seq counts et al. _A collaboration between scverse, Lamin, and anyone interested in contributing!_ From e75246f5e1f12df7ed3a40f07f21a29f600a8bcb Mon Sep 17 00:00:00 2001 From: Alex Wolf Date: Fri, 26 Sep 2025 16:53:04 +0200 Subject: [PATCH 13/13] =?UTF-8?q?=F0=9F=93=9D=20Prettier?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 3de77e4..d84d070 100644 --- a/README.md +++ b/README.md @@ -6,14 +6,18 @@ This repository contains benchmarking scripts & utilities for scRNA-seq data loa ## Quickstart -Typical calls of the main benchmarking script are: +Setup: ```bash git clone https://github.com/laminlabs/arrayloader-benchmarks cd arrayloader-benchmarks uv pip install -e ".[scdataset,annbatch]" # provide tools you'd like to install -lamin init # to init a new lamindb instance -# lamin connect laminlabs/arrayloader-benchmarks # to contribute results to the hosted lamindb instance +lamin connect laminlabs/arrayloader-benchmarks # to contribute results to the hosted lamindb instance, call `lamin init` to create a new lamindb instance +``` + +Typical calls of the main benchmarking script are: + +```bash cd scripts python run_loading_benchmark_on_collection.py annbatch # run annbatch on collection Tahoe100M_tiny, n_datasets = 1 python run_loading_benchmark_on_collection.py MappedCollection # run MappedCollection