Skip to content

Conversation

@jo-mueller
Copy link
Collaborator

Title. With more and bigger datasets, I felt like the amount of data shipped with the plugin has gotten a little out of hand. Currently, we are looking at around ~10MB of sample data, which seems unnecessarily large. I looked into the pooch library, and it basically allows to download files (such as sample data, also used in scikit-image) from a data registry whenever needed.

i figured it would be a neat thing to do to add the sample data as assets to the gh releases and then be able to download everything from there at runtime. Of course this wouldn't work while being offline, but I figure you'd have a hard time installing the plugin (along with 10MB of sample data) anyway when offline, so... 🤷‍♂️ )

In this PR, I essentially:

  • implemented pooch in the _sample_data.py functions
  • Added a small script to pack up all sample data in a zip file and
  • upload it as an asset to the latest gh release after the release has been created

Long copilot description

This pull request introduces significant improvements to how sample data is managed, distributed, and loaded in the project. The main changes include switching to remote fetching of sample data using the pooch library, refactoring sample data loading functions to use this new mechanism, updating the build and release workflow to handle sample data as a separate asset, and adjusting the packaging configuration to exclude sample data from the distribution. These updates make the sample data more maintainable, reduce package size, and streamline the release process.

Sample Data Handling and Loading:

  • Introduced the pooch library for remote fetching and caching of sample data, replacing previous methods that relied on local files. Added new functions load_image, load_tabular, and load_registry in _sample_data.py to load data from a remotely-fetched zip archive, and refactored all sample data functions to use these loaders. (src/napari_clusters_plotter/_sample_data.py [1] [2] [3] [4] [5] [6] [7]
  • Added a new script _create_sample_data_assets.py to create a zip archive of sample data and generate a data_registry.txt file with SHA256 hashes for integrity checks. (src/napari_clusters_plotter/_create_sample_data_assets.py src/napari_clusters_plotter/_create_sample_data_assets.pyR1-R44)
  • Updated project dependencies to include pooch for sample data management. (pyproject.toml pyproject.tomlL44-R45)

Build and Release Workflow:

  • Refactored the GitHub Actions workflow to separate building the distribution, publishing to PyPI, and attaching sample data as a release asset. The workflow now builds and uploads a sample data zip file as a separate asset on GitHub Releases, and only includes a registry file in the package. (.github/workflows/test_and_deploy.yml .github/workflows/test_and_deploy.ymlL66-R126)

Packaging and Distribution:

  • Modified MANIFEST.in to exclude sample data files from the package and only include the data_registry.txt file, reducing package size and ensuring sample data is fetched remotely. (MANIFEST.in MANIFEST.inL5-R7)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant