NN clusterizer: Bug-fixes and adding deterministic mode #14530

ChSonnabend · 2025-07-21T09:35:44Z

Adding deterministic mode: Unfortunately this seems to be a general problem with NN frameworks: Determinism cannot be guaranteed on GPU. Variation was tested to using FP16 networks with ~5x10e-5% variation for classification networks on FST with ~3e8 digit peaks. Tested with float32 model and switching off optimisations, but also this is not deterministic

github-actions · 2025-07-21T09:35:54Z

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

Please consider the following formatting changes to AliceO2Group#14530

alibuild · 2025-07-21T09:59:50Z

Error while checking build/O2/fullCI_slc9 for 05a64bb at 2025-07-21 11:59:

## sw/BUILD/O2-latest/log
/sw/SOURCES/O2/14530-slc9_x86-64/0/GPU/GPUTracking/Global/GPUChainTrackingClusterizer.cxx:659:17: error: unused variable 'nnFillInputTimers' [-Werror=unused-variable]
/sw/SOURCES/O2/14530-slc9_x86-64/0/GPU/GPUTracking/Global/GPUChainTrackingClusterizer.cxx:665:17: error: unused variable 'nnPublishingTimers' [-Werror=unused-variable]
make[2]: *** [GPU/GPUTracking/CMakeFiles/GPUTracking.dir/build.make:1245: GPU/GPUTracking/CMakeFiles/GPUTracking.dir/Global/GPUChainTrackingClusterizer.cxx.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:606: GPU/GPUTracking/CMakeFiles/GPUTracking.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

Full log here.

…nistic on GPU -> general problem with ONNX)

Please consider the following formatting changes to AliceO2Group#14530

ChSonnabend · 2025-07-26T23:31:45Z

@davidrohr Ready to merge. After being added to the people who can rerun the CI, I might also have the option to merge: I now have the button "Enable auto-merge (squash)". Still want to wait for your approval in any case, but only changed things for the NN clusterizer.

davidrohr · 2025-07-27T17:47:44Z

GPU/GPUTracking/Global/GPUChainTrackingClusterizer.cxx

  GPUTPCNNClusterizerHost nnApplications[GetProcessingSettings().nTPCClustererLanes];

+  // Maximum of 4 lanes supported
+  HighResTimer* nnTimers[12] = {


Could you obtain the timers only if

if (GetProcessingSettings().nn.applyNNclusterizer && GetProcessingSettings().debugLevel >= 1)

? You can still declare them unconditionally, but fill the array with valid timers only in an if condition.

davidrohr · 2025-07-27T17:49:07Z

@davidrohr Ready to merge. After being added to the people who can rerun the CI, I might also have the option to merge: I now have the button "Enable auto-merge (squash)". Still want to wait for your approval in any case, but only changed things for the NN clusterizer.

If you click enable auto merge, it will merge once you have the approvals. That is totally fine.

davidrohr · 2025-07-28T08:01:46Z

Common/ML/src/OrtInterface.cxx

    mEnableProfiling = (optionsMap.contains("enable-profiling") ? std::stoi(optionsMap["enable-profiling"]) : 0);
    mEnableOptimizations = (optionsMap.contains("enable-optimizations") ? std::stoi(optionsMap["enable-optimizations"]) : 0);
    mEnvName = (optionsMap.contains("onnx-environment-name") ? optionsMap["onnx-environment-name"] : "onnx_model_inference");
+    mDeterministicMode = (optionsMap.contains("deterministic-compute") ? std::stoi(optionsMap["deterministic-compute"]) : 0);


Why don't you use the existing flag for the deterministic mode? GPU_proc.deterministicGPUReconstruction

Because from what I could see it actually doesn't do what it's supposed to do (it's not deterministic even when setting that flag to 1, also reported like this in several other forum entries). On the other hand one can switch it on like this manually. But I can make it an or-statement to set it to 1 if GPU_proc.deterministicGPUReconstruction

Done. Did in the clusterizer code though, not directly in the OrtInterface class

Well, the GPU_proc.deterministicGPUReconstruction should guarantee that we get deterministic results. If we are not there yet in the NN clusterization then we have to work to understand what is non-deterministic.
But anything that is needed for deterministic results should automatically be enabled by this flag.
If you want to have more fine grained settings to enable partial deterministic stuff, you can add them in the NN settings. But the global flag should automatically enable all of them.

Please consider the following formatting changes to AliceO2Group#14530

…#14530) * Adding first version of kernel timers * Removing GPU_CONFIG_KEY from dpl-workflow.sh to set my own values * Bug fixes * undoing changes in dpl-workflow.sh * Furhter fixes and beautifications * Please consider the following formatting changes * Removing unused timers * Moving Stop() of classification timer * Adding force method to fill input like it is done on GPU * Removing unnecessary static asserts * Adding deterministic mode (unfortunately that did not make it deterministic on GPU -> general problem with ONNX) * Please consider the following formatting changes * Adjusting for comment * Adding deterministic mode * Please consider the following formatting changes --------- Co-authored-by: ALICE Action Bot <alibuild@cern.ch>

ChSonnabend added 9 commits June 28, 2025 20:17

Adding first version of kernel timers

5ef448c

Merge branch 'dev' into onnx_gpu_timer

520de51

Removing GPU_CONFIG_KEY from dpl-workflow.sh to set my own values

fcf46d9

Merge branch 'dev' into onnx_gpu_timer

f235071

Merge branch 'dev' into onnx_gpu_timer

4d64371

Merge branch 'dev' into onnx_gpu_timer

a221eb4

Bug fixes

7f70002

undoing changes in dpl-workflow.sh

ff34a9d

Furhter fixes and beautifications

05a64bb

ChSonnabend requested a review from davidrohr as a code owner July 21, 2025 09:35

Please consider the following formatting changes

8f06331

alibuild mentioned this pull request Jul 21, 2025

Please consider the following formatting changes to #14530 ChSonnabend/AliceO2#30

Merged

Merge pull request #30 from alibuild/alibot-cleanup-14530

27e6ead

Please consider the following formatting changes to AliceO2Group#14530

ChSonnabend added 4 commits July 21, 2025 12:52

Removing unused timers

152f459

Moving Stop() of classification timer

5801e3a

Adding force method to fill input like it is done on GPU

f4dcbaa

Removing unnecessary static asserts

e6482ab

ChSonnabend changed the title ~~NN clusterizer: Bug-fixes and beautifications~~ [WIP] NN clusterizer: Bug-fixes and beautifications Jul 23, 2025

Adding deterministic mode (unfortunately that did not make it determi…

f2d2b86

…nistic on GPU -> general problem with ONNX)

ChSonnabend requested a review from a team as a code owner July 25, 2025 14:14

ChSonnabend changed the title ~~[WIP] NN clusterizer: Bug-fixes and beautifications~~ NN clusterizer: Bug-fixes and adding deterministic mode Jul 25, 2025

Please consider the following formatting changes

f03fdc3

alibuild mentioned this pull request Jul 25, 2025

Please consider the following formatting changes to #14530 ChSonnabend/AliceO2#31

Merged

Merge pull request #31 from alibuild/alibot-cleanup-14530

38c3856

Please consider the following formatting changes to AliceO2Group#14530

davidrohr reviewed Jul 27, 2025

View reviewed changes

Adjusting for comment

52235a3

ChSonnabend enabled auto-merge (squash) July 27, 2025 22:26

ChSonnabend requested a review from davidrohr July 28, 2025 07:50

davidrohr reviewed Jul 28, 2025

View reviewed changes

ChSonnabend and others added 2 commits July 28, 2025 11:23

Adding deterministic mode

8c87d37

Please consider the following formatting changes

f8139a7

alibuild mentioned this pull request Jul 28, 2025

Please consider the following formatting changes to #14530 ChSonnabend/AliceO2#32

Merged

Merge pull request #32 from alibuild/alibot-cleanup-14530

7bec287

Please consider the following formatting changes to AliceO2Group#14530

ChSonnabend requested a review from davidrohr July 28, 2025 12:45

davidrohr approved these changes Jul 28, 2025

View reviewed changes

davidrohr disabled auto-merge July 28, 2025 14:01

davidrohr merged commit 69f6737 into AliceO2Group:dev Jul 28, 2025
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NN clusterizer: Bug-fixes and adding deterministic mode #14530

NN clusterizer: Bug-fixes and adding deterministic mode #14530

Uh oh!

ChSonnabend commented Jul 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 21, 2025

Uh oh!

alibuild commented Jul 21, 2025

Uh oh!

ChSonnabend commented Jul 26, 2025

Uh oh!

davidrohr Jul 27, 2025

Uh oh!

davidrohr commented Jul 27, 2025

Uh oh!

davidrohr Jul 28, 2025

Uh oh!

ChSonnabend Jul 28, 2025

Uh oh!

ChSonnabend Jul 28, 2025

Uh oh!

davidrohr Jul 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

NN clusterizer: Bug-fixes and adding deterministic mode #14530

NN clusterizer: Bug-fixes and adding deterministic mode #14530

Uh oh!

Conversation

ChSonnabend commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 21, 2025

Uh oh!

alibuild commented Jul 21, 2025

Uh oh!

ChSonnabend commented Jul 26, 2025

Uh oh!

davidrohr Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

davidrohr commented Jul 27, 2025

Uh oh!

davidrohr Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

ChSonnabend Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

ChSonnabend Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

davidrohr Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

ChSonnabend commented Jul 21, 2025 •

edited

Loading