[NN Clusterizer] CCDB fetching within reco workflow #14841

ChSonnabend · 2025-11-21T15:34:26Z

This PR removes the previous CCDB fetching and adds fetching during initialisation in GPUWorkflowSpec.cxx

github-actions · 2025-11-21T15:34:36Z

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

ChSonnabend · 2025-11-21T15:41:46Z

@davidrohr
A couple of comments / questions:

I now call *mConfParam = mConfig->ReadConfigurableParam(); in the task initialisation since I need the options from GPUSettingsList.h. Is this fine?
The current implementation only loads the network from CCDB into file and then loads it from file back into the ONNX framework later on. I still need to add the direct read from the char* and propagation to the clusterization task. Here the question: The overhead is minimal as these files are very small. Would this then even be necessary or is it acceptable to just fetch it as a file and then load it back in at a later stage? Is file fetching desirable in all cases or should it also work without storing a local file?
I currently load the files from the CCDB as char* using .payload and then split of the actual file content and header data manually (see: dumpOnnxToFile in GPUWorkflowSpec.cxx). Is there a smarter way? Maybe a function that I can use instead of manually searching for a substring?
All of this is tested and works for file fetching and loading into the reco workflow.

Please consider the following formatting changes to AliceO2Group#14841

ktf · 2025-11-21T16:20:26Z

Please, no intermediate files.

davidrohr · 2025-11-21T16:25:29Z

@ChSonnabend :

For 1.: I do not think this will cause a problem now, but I do not really like it, since it can have side effects. But this should be easy to fix. Let me think a bit. Perhaps we can even move the ReadConfigurableParam to the constructor, to call it only once. I will have a look when I am back. Otherwise, we can also access the settings directly from a local instance.

Creating local files is not a good idea. E.g., it is not even guaranteed that the workdir is writeable. If you create local files, they should go to some tmpdir, and have temporary names. E.g., if 2 workflows run in parallel in the same directory, they must not overwrite each others files.
In general, all calibration objects should go to the GPUCalibObjectsTemplate struct. Now, so far I only allow that the members should derive from FlatObject, or should be POD structs. However, this is not enforced. I think if you add a ptr to a CCDB object here, it should not be a problem.
You should get the data from the CCDB objects from within GPURecoWorkflowSpec::finaliseCCDBTPC(...). I assume you can get the ptrs in the same way as we do for the other objects. For sure you should not do a string search.

One point we have to think about though is: Do we want to allow to change the network during the processing? I.e., what happens when the time range of the CCDB object does not cover the entire run. In that case, we will receive a new CCDB object at some point while we are processing. That also means that the old CCDB object might be gone now, so we must no longer access its memory.
If we do not want to allow this, we should initialize at the first time frame, but when we receive a different object at a later time frame, we should throw a fatal indicating that this is not supported.

Finally, GPU reconstruction supports processing 2 time frames at the same time to hide GPU transfer delays. For this reason, I create a copy of all the calib objects, so when the second TF arrive, I still have the old object for the first TF. This is a bit hacky. And if we do not need to support changing the NN objects during a run, we do not need this functionality for the NN objectrs.
But this updating of objects is the reason that we have all these additional buffers like mdEdxCalibContainerBufferNew in finaliseCCDBTPC. If we really do not need this, it should be OK to directly store the pointers we are getting from the CCDB fetching to mCalibObjects (the GPUCalibObjectsTemplate struct), without creating a copy.

alibuild · 2025-11-22T03:19:25Z

Error while checking build/O2/fullCI_slc9 for 125f3e2 at 2025-11-22 16:23:

## sw/BUILD/O2-latest/log
CMake Error at /sw/slc9_x86-64/CMake/v3.31.6-6/share/cmake-3.31/Modules/FindPackageHandleStandardArgs.cmake:233 (message):

Full log here.

ChSonnabend · 2025-11-23T10:56:21Z

The new version now uses the internal char* buffer to load the model. Loading is successful in the clusterization task, however it fails at some internal stage within ONNX runtime (although model loading from file works and the buffer and size that I pass are correct). I am not sure yet what the problem is. Investigating.

Otherwise, is this solution feasible. Dumping to file is implemented and can be switched on or off on demand.

Please consider the following formatting changes to AliceO2Group#14841

alibuild · 2025-11-23T11:34:20Z

Error while checking build/O2/fullCI_slc9 for 5ce258c at 2025-11-23 12:34:

## sw/BUILD/O2-latest/log
CMake Error at /sw/slc9_x86-64/CMake/v3.31.6-6/share/cmake-3.31/Modules/FindPackageHandleStandardArgs.cmake:233 (message):

Full log here.

alibuild · 2025-11-25T00:09:35Z

Error while checking build/O2/fullCI_slc9 for 4fed621 at 2025-11-25 01:09:

## sw/BUILD/O2Physics-latest/log
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
[0 more errors; see full log]

Full log here.

…vector<char>

Please consider the following formatting changes to AliceO2Group#14841

ChSonnabend · 2025-11-25T10:29:08Z

Open points:

Specifying different metadata fetches the same object at runtime (@ktf)
*mConfParam = mConfig->ReadConfigurableParam(); needs to be called in order to have the settings available (@davidrohr )
Is there a better place to put: O2/GPU/GPUTracking/utils/convert_onnx_to_root_serialized.C (utility script that makes the fetching and uploading easier)

alibuild · 2025-11-25T19:23:48Z

Error while checking build/O2/fullCI_slc9 for 6cba1f3 at 2025-11-25 20:23:

## sw/BUILD/O2-latest/log
CMake Error at cmake/O2ReportNonTestedMacros.cmake:89 (message):

Full log here.

alibuild · 2025-11-27T02:50:10Z

Error while checking build/O2/fullCI_slc9 for 979a8d5 at 2025-11-27 03:50:

## sw/BUILD/O2Physics-latest/log
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
[0 more errors; see full log]

Full log here.

ChSonnabend added 10 commits October 18, 2025 00:55

Improve GPU filling kernel speed

9c8984c

Adjusting parameter bounds and additional GPU kernel optimizations

a075c43

Adding back if statement for early exit

587c3e6

const'ing + fixing CPU kernel

6e43257

Remiving print statements

bb795c4

Fixing CI build issue

f7cdc0b

Merge branch 'dev' into devel

8c7d5f4

Merge branch 'AliceO2Group:dev' into devel

a2aaf8e

Working version of NN CCDB fetching and loading to file

3775044

Cleanup

a963c01

ChSonnabend requested review from a team, davidrohr, shahor02 and wiechula as code owners November 21, 2025 15:34

Please consider the following formatting changes

caf20fc

alibuild mentioned this pull request Nov 21, 2025

Please consider the following formatting changes to #14841 ChSonnabend/AliceO2#38

Merged

Merge pull request #38 from alibuild/alibot-cleanup-14841

125f3e2

Please consider the following formatting changes to AliceO2Group#14841

ChSonnabend and others added 2 commits November 23, 2025 11:52

Using char* buffer for model loading

5284b01

Please consider the following formatting changes

ab19782

alibuild mentioned this pull request Nov 23, 2025

Please consider the following formatting changes to #14841 ChSonnabend/AliceO2#39

Merged

Merge pull request #39 from alibuild/alibot-cleanup-14841

5ce258c

Please consider the following formatting changes to AliceO2Group#14841

Bug-fix

4fed621

ChSonnabend and others added 2 commits November 25, 2025 11:26

Working version of CCDB fetching and loading into ROOT class of std::…

e7cd6fa

…vector<char>

Please consider the following formatting changes

9ed60e9

alibuild mentioned this pull request Nov 25, 2025

Please consider the following formatting changes to #14841 ChSonnabend/AliceO2#40

Merged

Merge pull request #40 from alibuild/alibot-cleanup-14841

ae1d630

Please consider the following formatting changes to AliceO2Group#14841

Disable dumpToFile by default

6cba1f3

Moving macro, adding o2-test

5c6d214

ChSonnabend requested a review from a team as a code owner November 25, 2025 20:05

Merge branch 'dev' into devel

979a8d5

davidrohr merged commit 1d620f2 into AliceO2Group:dev Nov 28, 2025
9 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NN Clusterizer] CCDB fetching within reco workflow #14841

[NN Clusterizer] CCDB fetching within reco workflow #14841

Uh oh!

ChSonnabend commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

ChSonnabend commented Nov 21, 2025

Uh oh!

ktf commented Nov 21, 2025

Uh oh!

davidrohr commented Nov 21, 2025

Uh oh!

alibuild commented Nov 22, 2025 •

edited

Loading

Uh oh!

ChSonnabend commented Nov 23, 2025

Uh oh!

alibuild commented Nov 23, 2025

Uh oh!

alibuild commented Nov 25, 2025

Uh oh!

ChSonnabend commented Nov 25, 2025

Uh oh!

alibuild commented Nov 25, 2025

Uh oh!

alibuild commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

[NN Clusterizer] CCDB fetching within reco workflow #14841

[NN Clusterizer] CCDB fetching within reco workflow #14841

Uh oh!

Conversation

ChSonnabend commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

ChSonnabend commented Nov 21, 2025

Uh oh!

ktf commented Nov 21, 2025

Uh oh!

davidrohr commented Nov 21, 2025

Uh oh!

alibuild commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChSonnabend commented Nov 23, 2025

Uh oh!

alibuild commented Nov 23, 2025

Uh oh!

alibuild commented Nov 25, 2025

Uh oh!

ChSonnabend commented Nov 25, 2025

Uh oh!

alibuild commented Nov 25, 2025

Uh oh!

alibuild commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

alibuild commented Nov 22, 2025 •

edited

Loading