Skip to content

Conversation

@ChSonnabend
Copy link
Collaborator

@ChSonnabend ChSonnabend commented Jul 21, 2025

Adding deterministic mode: Unfortunately this seems to be a general problem with NN frameworks: Determinism cannot be guaranteed on GPU. Variation was tested to using FP16 networks with ~5x10e-5% variation for classification networks on FST with ~3e8 digit peaks. Tested with float32 model and switching off optimisations, but also this is not deterministic

@ChSonnabend ChSonnabend requested a review from davidrohr as a code owner July 21, 2025 09:35
@github-actions
Copy link
Contributor

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

Please consider the following formatting changes to AliceO2Group#14530
@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for 05a64bb at 2025-07-21 11:59:

## sw/BUILD/O2-latest/log
/sw/SOURCES/O2/14530-slc9_x86-64/0/GPU/GPUTracking/Global/GPUChainTrackingClusterizer.cxx:659:17: error: unused variable 'nnFillInputTimers' [-Werror=unused-variable]
/sw/SOURCES/O2/14530-slc9_x86-64/0/GPU/GPUTracking/Global/GPUChainTrackingClusterizer.cxx:665:17: error: unused variable 'nnPublishingTimers' [-Werror=unused-variable]
make[2]: *** [GPU/GPUTracking/CMakeFiles/GPUTracking.dir/build.make:1245: GPU/GPUTracking/CMakeFiles/GPUTracking.dir/Global/GPUChainTrackingClusterizer.cxx.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:606: GPU/GPUTracking/CMakeFiles/GPUTracking.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

Full log here.

@ChSonnabend ChSonnabend changed the title NN clusterizer: Bug-fixes and beautifications [WIP] NN clusterizer: Bug-fixes and beautifications Jul 23, 2025
@ChSonnabend ChSonnabend requested a review from a team as a code owner July 25, 2025 14:14
@ChSonnabend ChSonnabend changed the title [WIP] NN clusterizer: Bug-fixes and beautifications NN clusterizer: Bug-fixes and adding deterministic mode Jul 25, 2025
Please consider the following formatting changes to AliceO2Group#14530
@ChSonnabend
Copy link
Collaborator Author

@davidrohr Ready to merge. After being added to the people who can rerun the CI, I might also have the option to merge: I now have the button "Enable auto-merge (squash)". Still want to wait for your approval in any case, but only changed things for the NN clusterizer.

GPUTPCNNClusterizerHost nnApplications[GetProcessingSettings().nTPCClustererLanes];

// Maximum of 4 lanes supported
HighResTimer* nnTimers[12] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you obtain the timers only if

if (GetProcessingSettings().nn.applyNNclusterizer && GetProcessingSettings().debugLevel >= 1)

? You can still declare them unconditionally, but fill the array with valid timers only in an if condition.

@davidrohr
Copy link
Collaborator

@davidrohr Ready to merge. After being added to the people who can rerun the CI, I might also have the option to merge: I now have the button "Enable auto-merge (squash)". Still want to wait for your approval in any case, but only changed things for the NN clusterizer.

If you click enable auto merge, it will merge once you have the approvals. That is totally fine.

@ChSonnabend ChSonnabend enabled auto-merge (squash) July 27, 2025 22:26
@ChSonnabend ChSonnabend requested a review from davidrohr July 28, 2025 07:50
mEnableProfiling = (optionsMap.contains("enable-profiling") ? std::stoi(optionsMap["enable-profiling"]) : 0);
mEnableOptimizations = (optionsMap.contains("enable-optimizations") ? std::stoi(optionsMap["enable-optimizations"]) : 0);
mEnvName = (optionsMap.contains("onnx-environment-name") ? optionsMap["onnx-environment-name"] : "onnx_model_inference");
mDeterministicMode = (optionsMap.contains("deterministic-compute") ? std::stoi(optionsMap["deterministic-compute"]) : 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use the existing flag for the deterministic mode? GPU_proc.deterministicGPUReconstruction

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because from what I could see it actually doesn't do what it's supposed to do (it's not deterministic even when setting that flag to 1, also reported like this in several other forum entries). On the other hand one can switch it on like this manually. But I can make it an or-statement to set it to 1 if GPU_proc.deterministicGPUReconstruction

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Did in the clusterizer code though, not directly in the OrtInterface class

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the GPU_proc.deterministicGPUReconstruction should guarantee that we get deterministic results. If we are not there yet in the NN clusterization then we have to work to understand what is non-deterministic.
But anything that is needed for deterministic results should automatically be enabled by this flag.
If you want to have more fine grained settings to enable partial deterministic stuff, you can add them in the NN settings. But the global flag should automatically enable all of them.

Please consider the following formatting changes to AliceO2Group#14530
@ChSonnabend ChSonnabend requested a review from davidrohr July 28, 2025 12:45
@davidrohr davidrohr disabled auto-merge July 28, 2025 14:01
@davidrohr davidrohr merged commit 69f6737 into AliceO2Group:dev Jul 28, 2025
14 checks passed
mhemmer-cern pushed a commit to mhemmer-cern/AliceO2 that referenced this pull request Sep 9, 2025
…#14530)

* Adding first version of kernel timers

* Removing GPU_CONFIG_KEY from dpl-workflow.sh to set my own values

* Bug fixes

* undoing changes in dpl-workflow.sh

* Furhter fixes and beautifications

* Please consider the following formatting changes

* Removing unused timers

* Moving Stop() of classification timer

* Adding force method to fill input like it is done on GPU

* Removing unnecessary static asserts

* Adding deterministic mode (unfortunately that did not make it deterministic on GPU -> general problem with ONNX)

* Please consider the following formatting changes

* Adjusting for comment

* Adding deterministic mode

* Please consider the following formatting changes

---------

Co-authored-by: ALICE Action Bot <alibuild@cern.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants