-
Notifications
You must be signed in to change notification settings - Fork 483
ITS: GPU: better use of streams #14563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
|
REQUEST FOR PRODUCTION RELEASES: This will add The following labels are available |
|
"only" 15% seems very good to me! Is it on pp or Pb-Pb? |
|
35 kHz Pb-Pb, in principle yes 15% is nice indeed but I think with a bit better handling of the scheduling this can go to 30%. |
|
@f3sch , I get |
|
@shahor02, thanks. What is your local cuda & toolkit version? The CI builds with cuda 12.9.86 |
|
@f3sch thanks, indeed, my CUDA is older, will try to update: |
|
@shahor02 sure, let me know if you don't succeed. I think we can also for now simply put a #ifdef around this section since it is not performance critical. I just changed it more in preparation for something else. |

Makes better use of multiple streams. Works also in deterministic mode. Overall gain on 100 TFs is only 15% of the total time (dominated by the track fitting which was not touched but individual speedup is more).
One other problem I observed that after the counting phase the needed cudaMalloAsync still force the insert phase to be essentially sequential since its blocking (or also the temp. allocations when calculating the LUTs). If one can resolve this the speedup would double.