Fetching sample datasets remotely #465
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Title. With more and bigger datasets, I felt like the amount of data shipped with the plugin has gotten a little out of hand. Currently, we are looking at around ~10MB of sample data, which seems unnecessarily large. I looked into the pooch library, and it basically allows to download files (such as sample data, also used in scikit-image) from a data registry whenever needed.
i figured it would be a neat thing to do to add the sample data as assets to the gh releases and then be able to download everything from there at runtime. Of course this wouldn't work while being offline, but I figure you'd have a hard time installing the plugin (along with 10MB of sample data) anyway when offline, so... 🤷♂️ )
In this PR, I essentially:
_sample_data.pyfunctionsLong copilot description
This pull request introduces significant improvements to how sample data is managed, distributed, and loaded in the project. The main changes include switching to remote fetching of sample data using the
poochlibrary, refactoring sample data loading functions to use this new mechanism, updating the build and release workflow to handle sample data as a separate asset, and adjusting the packaging configuration to exclude sample data from the distribution. These updates make the sample data more maintainable, reduce package size, and streamline the release process.Sample Data Handling and Loading:
poochlibrary for remote fetching and caching of sample data, replacing previous methods that relied on local files. Added new functionsload_image,load_tabular, andload_registryin_sample_data.pyto load data from a remotely-fetched zip archive, and refactored all sample data functions to use these loaders. (src/napari_clusters_plotter/_sample_data.py[1] [2] [3] [4] [5] [6] [7]_create_sample_data_assets.pyto create a zip archive of sample data and generate adata_registry.txtfile with SHA256 hashes for integrity checks. (src/napari_clusters_plotter/_create_sample_data_assets.pysrc/napari_clusters_plotter/_create_sample_data_assets.pyR1-R44)poochfor sample data management. (pyproject.tomlpyproject.tomlL44-R45)Build and Release Workflow:
.github/workflows/test_and_deploy.yml.github/workflows/test_and_deploy.ymlL66-R126)Packaging and Distribution:
MANIFEST.into exclude sample data files from the package and only include thedata_registry.txtfile, reducing package size and ensuring sample data is fetched remotely. (MANIFEST.inMANIFEST.inL5-R7)