Skip to content

Conversation

@takluyver
Copy link
Member

This aims to make it easier to apply a function to each frame of multi-module detector data, like in this screenshot:

image

Azimuthal integration is one particular motivating use case.

Design:

  • The core idea is to batch frames together, so we're submitting fewer, larger tasks
    • Each task loads a chunk of data (by default ~1000 frames), and then runs the function on these sequentially
  • You can ideally use any map method - local thread/process pools, Dask (as in the screenshot), clusterfutures...
  • If the per-frame function has parameter names like mask or cellId, the corresponding data will be loaded and passed to it.
  • It returns a list (one result per frame) by default, but you also have the option to make an array, which should be a bit more efficient than using np.stack().

Concerns & questions:

  • I needed some kludgy specific workarounds to get this working nicely with Dask, which was one of my main goals. In particular, Dask was spending an inordinately long time making unique names for the tasks, until I overrode it with random names.
  • Azimuthal integration was the motivating use case, but it's actually kind of inefficient for that. If you construct the AzimuthalIntegrator outside the function, you have to send about 100 MB of data for each batch task (for AGIPD-1M: 3D positions of each corner of 1 million pixels). If you construct it inside the function, you're redoing that for every frame. 🤔
  • Possible extension: add a parameter so that if out_shared=True, workers write directly to a shared output array, rather than serialising data to send back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants