Skip to content

Conversation

@philsmt
Copy link
Contributor

@philsmt philsmt commented Mar 22, 2022

As it came up again in European-XFEL/pasha#14, here's a proof-of-concept on how to move the ExtraDataFunctor implementation to EXtra-data itself. This would allow it to take advantage of private APIs as well as test properly (TBD). So far it's almost exactly the same implementation only with unnecessary import checks removed.

As the code is only loaded conditionally, it does not add an actual dependency. The import of gen_split_slices could even be removed once SourceData has its own .split_trains() method.

Any ideas for the module name?

@takluyver
Copy link
Member

Thanks. I think pasha_functor is fine for the module name - it shouldn't normally be visible in any case.

As we were just discussing, .split_trains() doesn't fully work for this, because the objects it produces don't know the index of their trains in the original object. I think it might be a useful enhancement for pasha to define a default splitting based on len(functor), so we don't need gen_split_slices everywhere. But we do also have basically the same code in read_machinery (as split_trains), so it's easy enough for this purpose.

@takluyver takluyver added the enhancement New feature or request label Apr 6, 2022
@takluyver
Copy link
Member

Do you want to add a test for the pasha integration, BTW?

@philsmt
Copy link
Contributor Author

philsmt commented Apr 27, 2022

Yes. How would the tests work dependency-wise? Do we add pasha as an optional feature, or not all?

I do not much like the hack for iterating SourceData objects anymore. I've looked into supporting it properly in the class natively, but I think the proper solution for this is to move various find-data-and-chunks methods from DataCollectiontoSourceData, in particular since we now have SourceData` objects anyway at all times. So let's move this to a later release, and I will remove the support here.

# different process than the functor was initially created in,
# close all file handles inherited from the parent collection to
# force re-opening them again in each child process.
if getpid() != self._parent_pid:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this at all with the dependencies we require?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not really.

It's a bit tricky to be definitive because the change is in HDF5, and h5py can be built with a wide range of HDF5 versions. EXtra-data supports h5py 2.10, and the pre-built packages of 2.10 on PyPI have HDF5 1.10.4. So I might bump the required version to >= 3.0, just to be a bit cautious. I hope very few people are still stuck on h5py 2.x.

@takluyver
Copy link
Member

I would add pasha to the tests extra in setup.py (which already includes things like cloudpickle and dask).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants