-
Notifications
You must be signed in to change notification settings - Fork 483
NN clusterizer: Bug-fixes and adding deterministic mode #14530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
REQUEST FOR PRODUCTION RELEASES: This will add The following labels are available |
Please consider the following formatting changes to AliceO2Group#14530
|
Error while checking build/O2/fullCI_slc9 for 05a64bb at 2025-07-21 11:59: Full log here. |
…nistic on GPU -> general problem with ONNX)
Please consider the following formatting changes to AliceO2Group#14530
|
@davidrohr Ready to merge. After being added to the people who can rerun the CI, I might also have the option to merge: I now have the button "Enable auto-merge (squash)". Still want to wait for your approval in any case, but only changed things for the NN clusterizer. |
| GPUTPCNNClusterizerHost nnApplications[GetProcessingSettings().nTPCClustererLanes]; | ||
|
|
||
| // Maximum of 4 lanes supported | ||
| HighResTimer* nnTimers[12] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you obtain the timers only if
if (GetProcessingSettings().nn.applyNNclusterizer && GetProcessingSettings().debugLevel >= 1)
? You can still declare them unconditionally, but fill the array with valid timers only in an if condition.
If you click enable auto merge, it will merge once you have the approvals. That is totally fine. |
| mEnableProfiling = (optionsMap.contains("enable-profiling") ? std::stoi(optionsMap["enable-profiling"]) : 0); | ||
| mEnableOptimizations = (optionsMap.contains("enable-optimizations") ? std::stoi(optionsMap["enable-optimizations"]) : 0); | ||
| mEnvName = (optionsMap.contains("onnx-environment-name") ? optionsMap["onnx-environment-name"] : "onnx_model_inference"); | ||
| mDeterministicMode = (optionsMap.contains("deterministic-compute") ? std::stoi(optionsMap["deterministic-compute"]) : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use the existing flag for the deterministic mode? GPU_proc.deterministicGPUReconstruction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because from what I could see it actually doesn't do what it's supposed to do (it's not deterministic even when setting that flag to 1, also reported like this in several other forum entries). On the other hand one can switch it on like this manually. But I can make it an or-statement to set it to 1 if GPU_proc.deterministicGPUReconstruction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Did in the clusterizer code though, not directly in the OrtInterface class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the GPU_proc.deterministicGPUReconstruction should guarantee that we get deterministic results. If we are not there yet in the NN clusterization then we have to work to understand what is non-deterministic.
But anything that is needed for deterministic results should automatically be enabled by this flag.
If you want to have more fine grained settings to enable partial deterministic stuff, you can add them in the NN settings. But the global flag should automatically enable all of them.
Please consider the following formatting changes to AliceO2Group#14530
…#14530) * Adding first version of kernel timers * Removing GPU_CONFIG_KEY from dpl-workflow.sh to set my own values * Bug fixes * undoing changes in dpl-workflow.sh * Furhter fixes and beautifications * Please consider the following formatting changes * Removing unused timers * Moving Stop() of classification timer * Adding force method to fill input like it is done on GPU * Removing unnecessary static asserts * Adding deterministic mode (unfortunately that did not make it deterministic on GPU -> general problem with ONNX) * Please consider the following formatting changes * Adjusting for comment * Adding deterministic mode * Please consider the following formatting changes --------- Co-authored-by: ALICE Action Bot <alibuild@cern.ch>
Adding deterministic mode: Unfortunately this seems to be a general problem with NN frameworks: Determinism cannot be guaranteed on GPU. Variation was tested to using FP16 networks with ~5x10e-5% variation for classification networks on FST with ~3e8 digit peaks. Tested with float32 model and switching off optimisations, but also this is not deterministic