-
Notifications
You must be signed in to change notification settings - Fork 483
DPL Analysis: improve grouping performance further #14600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
REQUEST FOR PRODUCTION RELEASES: This will add The following labels are available |
|
Error while checking build/O2/fullCI_slc9 for 67d8b69 at 2025-09-03 12:00: Full log here. |
|
@ddobrigk could you check with your benchmark? |
|
I'm running the benchmarks now - @aalkin can you confirm these changes are for the standard sorted grouping and not unsorted? Thanks! |
|
Hi @ktf @aalkin, sorry for not writing before: here is the outcome of the benchmarks. It seems the new code has better performance in the asymptotic regime of large groups and is essentially unaltered for small group cases: see attached figure. I would say this can be merged so that we improve the asymptotic behaviour. Thanks a lot for the work! Maybe a question to @aalkin : that unsorted performance trend is still pretty bad, at 1000x worse than the custom solution in the large-group limit. It would be good if we could improve that at some point as well...
|
|
Merging following @ddobrigk comments. We should have some sort of benchmark / validation suite for O2Physics and run it in the CI. |
Co-authored-by: ALICE Action Bot <alibuild@cern.ch>

@ddobrigk @ktf
By skipping most of the heavy processing for the zero-size slices and avoiding allocations for the values_counts arrow function - just calculating sizes and offsets in one loop over the index column - I was able to reduce the processing time for an extra 1 second per 100 000 events.
Profiling demonstrates that the remaining ~3 seconds per 100 000 events are pretty much all in creation of non empty slices, which involves string lookup of columns and binding of the iterators. In addition, there is a considerable fraction of time taken by deconstruction of shared pointers to arrow tables when the slices go out of scope.
The "naive" code by @ddobrigk, since it does not construct any tables at all, avoids the overhead completely. But we cannot, of course, use this as a generic approach.
The next improvement I can do is to avoid constructing empty arrow table slices at all, while still providing the
soa::Tableslices for the user with the zero size and no bindings. That should additionally reduce the processing time by about 15%. Switching from string-based column lookup to the order-based would give a huge performance boost as well, but for that we need to determine if we still have any data left where the branch order in the stored trees differs from the column order in the data model.