Fetching sample datasets remotely #465

jo-mueller · 2025-11-28T23:36:05Z

Title. With more and bigger datasets, I felt like the amount of data shipped with the plugin has gotten a little out of hand. Currently, we are looking at around ~10MB of sample data, which seems unnecessarily large. I looked into the pooch library, and it basically allows to download files (such as sample data, also used in scikit-image) from a data registry whenever needed.

i figured it would be a neat thing to do to add the sample data as assets to the gh releases and then be able to download everything from there at runtime. Of course this wouldn't work while being offline, but I figure you'd have a hard time installing the plugin (along with 10MB of sample data) anyway when offline, so... 🤷‍♂️ )

In this PR, I essentially:

implemented pooch in the _sample_data.py functions
Added a small script to pack up all sample data in a zip file and
upload it as an asset to the latest gh release after the release has been created

Long copilot description

This pull request introduces significant improvements to how sample data is managed, distributed, and loaded in the project. The main changes include switching to remote fetching of sample data using the pooch library, refactoring sample data loading functions to use this new mechanism, updating the build and release workflow to handle sample data as a separate asset, and adjusting the packaging configuration to exclude sample data from the distribution. These updates make the sample data more maintainable, reduce package size, and streamline the release process.

Sample Data Handling and Loading:

Introduced the pooch library for remote fetching and caching of sample data, replacing previous methods that relied on local files. Added new functions load_image, load_tabular, and load_registry in _sample_data.py to load data from a remotely-fetched zip archive, and refactored all sample data functions to use these loaders. (src/napari_clusters_plotter/_sample_data.py [1] [2] [3] [4] [5] [6] [7]
Added a new script _create_sample_data_assets.py to create a zip archive of sample data and generate a data_registry.txt file with SHA256 hashes for integrity checks. (src/napari_clusters_plotter/_create_sample_data_assets.py src/napari_clusters_plotter/_create_sample_data_assets.pyR1-R44)
Updated project dependencies to include pooch for sample data management. (pyproject.toml pyproject.tomlL44-R45)

Build and Release Workflow:

Refactored the GitHub Actions workflow to separate building the distribution, publishing to PyPI, and attaching sample data as a release asset. The workflow now builds and uploads a sample data zip file as a separate asset on GitHub Releases, and only includes a registry file in the package. (.github/workflows/test_and_deploy.yml .github/workflows/test_and_deploy.ymlL66-R126)

Packaging and Distribution:

Modified MANIFEST.in to exclude sample data files from the package and only include the data_registry.txt file, reducing package size and ensuring sample data is fetched remotely. (MANIFEST.in MANIFEST.inL5-R7)

for more information, see https://pre-commit.ci

jo-mueller and others added 12 commits November 28, 2025 23:03

remove spaces and brackets in file name

22443ab

add pooch to deps

467dd7f

pull sample data from pooch registry

79a635e

use trusted publishing workflow and upload assets upon release

ec55df0

untrack sample data

3e56ee2

don't ship sample data

115581b

Create create_sample_data_assets.py

f4f7a7c

moved registry creation into parent

05701f4

download registry overview from assets

f93f157

untrack data_registry

286e2cc

moved up asset upload

a6787f4

[pre-commit.ci] auto fixes from pre-commit.com hooks

9185c76

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fetching sample datasets remotely #465

Fetching sample datasets remotely #465

Uh oh!

jo-mueller commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fetching sample datasets remotely #465

Are you sure you want to change the base?

Fetching sample datasets remotely #465

Uh oh!

Conversation

jo-mueller commented Nov 28, 2025

Long copilot description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant