Skip to content

Conversation

@ChSonnabend
Copy link
Collaborator

This PR removes the previous CCDB fetching and adds fetching during initialisation in GPUWorkflowSpec.cxx

@github-actions
Copy link
Contributor

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

@ChSonnabend
Copy link
Collaborator Author

@davidrohr
A couple of comments / questions:

  1. I now call *mConfParam = mConfig->ReadConfigurableParam(); in the task initialisation since I need the options from GPUSettingsList.h. Is this fine?
  2. The current implementation only loads the network from CCDB into file and then loads it from file back into the ONNX framework later on. I still need to add the direct read from the char* and propagation to the clusterization task. Here the question: The overhead is minimal as these files are very small. Would this then even be necessary or is it acceptable to just fetch it as a file and then load it back in at a later stage? Is file fetching desirable in all cases or should it also work without storing a local file?
  3. I currently load the files from the CCDB as char* using .payload and then split of the actual file content and header data manually (see: dumpOnnxToFile in GPUWorkflowSpec.cxx). Is there a smarter way? Maybe a function that I can use instead of manually searching for a substring?
    All of this is tested and works for file fetching and loading into the reco workflow.

Please consider the following formatting changes to AliceO2Group#14841
@ktf
Copy link
Member

ktf commented Nov 21, 2025

Please, no intermediate files.

@davidrohr
Copy link
Collaborator

@ChSonnabend :

For 1.: I do not think this will cause a problem now, but I do not really like it, since it can have side effects. But this should be easy to fix. Let me think a bit. Perhaps we can even move the ReadConfigurableParam to the constructor, to call it only once. I will have a look when I am back. Otherwise, we can also access the settings directly from a local instance.

  1. Creating local files is not a good idea. E.g., it is not even guaranteed that the workdir is writeable. If you create local files, they should go to some tmpdir, and have temporary names. E.g., if 2 workflows run in parallel in the same directory, they must not overwrite each others files.
    In general, all calibration objects should go to the GPUCalibObjectsTemplate struct. Now, so far I only allow that the members should derive from FlatObject, or should be POD structs. However, this is not enforced. I think if you add a ptr to a CCDB object here, it should not be a problem.

  2. You should get the data from the CCDB objects from within GPURecoWorkflowSpec::finaliseCCDBTPC(...). I assume you can get the ptrs in the same way as we do for the other objects. For sure you should not do a string search.

One point we have to think about though is: Do we want to allow to change the network during the processing? I.e., what happens when the time range of the CCDB object does not cover the entire run. In that case, we will receive a new CCDB object at some point while we are processing. That also means that the old CCDB object might be gone now, so we must no longer access its memory.
If we do not want to allow this, we should initialize at the first time frame, but when we receive a different object at a later time frame, we should throw a fatal indicating that this is not supported.

Finally, GPU reconstruction supports processing 2 time frames at the same time to hide GPU transfer delays. For this reason, I create a copy of all the calib objects, so when the second TF arrive, I still have the old object for the first TF. This is a bit hacky. And if we do not need to support changing the NN objects during a run, we do not need this functionality for the NN objectrs.
But this updating of objects is the reason that we have all these additional buffers like mdEdxCalibContainerBufferNew in finaliseCCDBTPC. If we really do not need this, it should be OK to directly store the pointers we are getting from the CCDB fetching to mCalibObjects (the GPUCalibObjectsTemplate struct), without creating a copy.

@alibuild
Copy link
Collaborator

alibuild commented Nov 22, 2025

Error while checking build/O2/fullCI_slc9 for 125f3e2 at 2025-11-22 16:23:

## sw/BUILD/O2-latest/log
CMake Error at /sw/slc9_x86-64/CMake/v3.31.6-6/share/cmake-3.31/Modules/FindPackageHandleStandardArgs.cmake:233 (message):

Full log here.

@ChSonnabend
Copy link
Collaborator Author

The new version now uses the internal char* buffer to load the model. Loading is successful in the clusterization task, however it fails at some internal stage within ONNX runtime (although model loading from file works and the buffer and size that I pass are correct). I am not sure yet what the problem is. Investigating.

Otherwise, is this solution feasible. Dumping to file is implemented and can be switched on or off on demand.

Please consider the following formatting changes to AliceO2Group#14841
@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for 5ce258c at 2025-11-23 12:34:

## sw/BUILD/O2-latest/log
CMake Error at /sw/slc9_x86-64/CMake/v3.31.6-6/share/cmake-3.31/Modules/FindPackageHandleStandardArgs.cmake:233 (message):

Full log here.

@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for 4fed621 at 2025-11-25 01:09:

## sw/BUILD/O2Physics-latest/log
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
[0 more errors; see full log]

Full log here.

Please consider the following formatting changes to AliceO2Group#14841
@ChSonnabend
Copy link
Collaborator Author

Open points:

  • Specifying different metadata fetches the same object at runtime (@ktf)
  • *mConfParam = mConfig->ReadConfigurableParam(); needs to be called in order to have the settings available (@davidrohr )
  • Is there a better place to put: O2/GPU/GPUTracking/utils/convert_onnx_to_root_serialized.C (utility script that makes the fetching and uploading easier)

@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for 6cba1f3 at 2025-11-25 20:23:

## sw/BUILD/O2-latest/log
CMake Error at cmake/O2ReportNonTestedMacros.cmake:89 (message):

Full log here.

@ChSonnabend ChSonnabend requested a review from a team as a code owner November 25, 2025 20:05
@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for 979a8d5 at 2025-11-27 03:50:

## sw/BUILD/O2Physics-latest/log
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
[0 more errors; see full log]

Full log here.

@davidrohr davidrohr merged commit 1d620f2 into AliceO2Group:dev Nov 28, 2025
9 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants