Map function over detector frames #333

takluyver · 2022-08-03T09:14:45Z

This aims to make it easier to apply a function to each frame of multi-module detector data, like in this screenshot:

Azimuthal integration is one particular motivating use case.

Design:

The core idea is to batch frames together, so we're submitting fewer, larger tasks
- Each task loads a chunk of data (by default ~1000 frames), and then runs the function on these sequentially
You can ideally use any map method - local thread/process pools, Dask (as in the screenshot), clusterfutures...
If the per-frame function has parameter names like mask or cellId, the corresponding data will be loaded and passed to it.
It returns a list (one result per frame) by default, but you also have the option to make an array, which should be a bit more efficient than using np.stack().

Concerns & questions:

I needed some kludgy specific workarounds to get this working nicely with Dask, which was one of my main goals. In particular, Dask was spending an inordinately long time making unique names for the tasks, until I overrode it with random names.
Azimuthal integration was the motivating use case, but it's actually kind of inefficient for that. If you construct the AzimuthalIntegrator outside the function, you have to send about 100 MB of data for each batch task (for AGIPD-1M: 3D positions of each corner of 1 million pixels). If you construct it inside the function, you're redoing that for every frame. 🤔
Possible extension: add a parameter so that if out_shared=True, workers write directly to a shared output array, rather than serialising data to send back.

takluyver and others added 4 commits August 2, 2022 17:06

Add det.map_frames() method

e151c5e

Fix loading data as numpy array

4a9cc20

Make chunk_func not close over self

88b7b67

Some workaround for Dask's Client.map() interface

99b1124

takluyver mentioned this pull request Aug 18, 2022

Multi-module KeyData interface #337

Merged

Provide feedback