ITS: GPU: better use of streams #14563

f3sch · 2025-08-08T11:02:29Z

Makes better use of multiple streams. Works also in deterministic mode. Overall gain on 100 TFs is only 15% of the total time (dominated by the track fitting which was not touched but individual speedup is more).

Total time in its_time_benchmarks_old.txt: 108960.237939
Total time in its_time_benchmarks_new.txt: 91904.873816

  Difference (file2 - file1): -17055.364123
  Relative change: -15.65%

Metric: cell_finding
  its_time_benchmarks_old.txt: 5229.111419 (4.80% of total)
  its_time_benchmarks_new.txt: 3776.888696 (4.11% of total)
  Difference (file2 - file1): -1452.222723
  Relative change: -27.77%

Metric: neighbour_finding
  its_time_benchmarks_old.txt: 15058.566217 (13.82% of total)
  its_time_benchmarks_new.txt: 13635.751449 (14.84% of total)
  Difference (file2 - file1): -1422.814768
  Relative change: -9.45%

Metric: road_finding
  its_time_benchmarks_old.txt: 54911.194296 (50.40% of total)
  its_time_benchmarks_new.txt: 53212.793432 (57.90% of total)
  Difference (file2 - file1): -1698.400864
  Relative change: -3.09%

Metric: timeframe_initialisation
  its_time_benchmarks_old.txt: 1323.038784 (1.21% of total)
  its_time_benchmarks_new.txt: 1205.854911 (1.31% of total)
  Difference (file2 - file1): -117.183873
  Relative change: -8.86%

Metric: tracklet_finding
  its_time_benchmarks_old.txt: 32438.327223 (29.77% of total)
  its_time_benchmarks_new.txt: 20073.585328 (21.84% of total)
  Difference (file2 - file1): -12364.741895
  Relative change: -38.12%

One other problem I observed that after the counting phase the needed cudaMalloAsync still force the insert phase to be essentially sequential since its blocking (or also the temp. allocations when calculating the LUTs). If one can resolve this the speedup would double.

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>

github-actions · 2025-08-08T11:02:37Z

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

mconcas · 2025-08-09T11:50:15Z

"only" 15% seems very good to me! Is it on pp or Pb-Pb?

f3sch · 2025-08-09T11:52:23Z

35 kHz Pb-Pb, in principle yes 15% is nice indeed but I think with a bit better handling of the scheduling this can go to 30%.

f3sch · 2025-08-09T11:56:00Z

You can see that the counting phase indeed works nicely on different streams but then with the memory allocations, the compute phase is essentially sequential again.

shahor02 · 2025-08-09T23:23:37Z

@f3sch , I get

FAILED: Detectors/ITSMFT/ITS/tracking/GPU/cuda/CMakeFiles/O2lib-ITStrackingCUDA.dir/TrackingKernels.cu.o 
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DBOOST_ASIO_HAS_HAS_STD_CHRONO -DBOOST_ATOMIC_DYN_LINK -DBOOST_ATOMIC_NO_LIB -DBOOST_CONTAINER_DYN_LINK -DBOOST_CONTAINER_NO_LIB -DBOOST_ERROR_CODE_HEADER_ONLY -DBOOST_FILESYSTEM_DYN_LINK -DBOOST_FILESYSTEM_NO_LIB -DBOOST_IOSTREAMS_DYN_LINK -DBOOST_IOSTREAMS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_REGEX_DYN_LINK -DBOOST_REGEX_NO_LIB -DDPL_ENABLE_BACKTRACE -DENABLE_UPGRADES -DFAIRMQ_HAS_STD_FILESYSTEM=1 -DFAIRMQ_HAS_STD_PMR=1 -DFMT_SHARED -DGENERATORS_WITH_HEPMC3 -DGENERATORS_WITH_PYTHIA8 -DGPUCA_GPUTYPE_AMPERE -DO2lib_ITStrackingCUDA_EXPORTS -DRANS_ENABLE_JSON -DRANS_OPENMP -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU -I/home/shahoian/alice/sw/BUILD/O2-latest/O2/Detectors/ITSMFT/ITS/tracking/GPU/cuda -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/Common -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Logger/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/Constants/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Detectors/ITSMFT/common/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/common/base/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/MathUtils/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Detectors/Common/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/Utils -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Headers/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/MemoryResources/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Utilities/rANS/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Core/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/simulation/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Foundation/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Foundation/3rdparty/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Utilities/PCG/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Foundation/3rdparty/x9 -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/Utils/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/common/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/SimConfig/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Parameters/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/Types/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/CCDB/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Calibration/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Reconstruction/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/Field/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Detectors/CTP/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/base/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/Base/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/GPUTracking -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/GPUTracking/Definitions -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/GPUTracking/DataTypes -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/TPCFastTransformation -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/TPCFastTransformation/devtools -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/TPC/spacecharge/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/TPC/base/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Detectors/TPC/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Utilities/DataSampling/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Algorithm/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/Raw/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Utils/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/Upgrades/ITS3/base/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/reconstruction/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/common/reconstruction/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Detectors/ITSMFT/ITS/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/fmt/11.1.2-4/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/FairLogger/v2.1.0-5/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/ms_gsl/4.0.0-16/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/ROOT/v6-32-06-alice8-5/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/Vc/1.4.5-3/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/boost/v1.83.0-alice2-35/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/FairMQ/v1.10.0-2/include/fairmq -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/FairMQ/v1.10.0-2/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/abseil/20240722.0-2/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/RapidJSON/v1.1.0-alice2-19/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/Configuration/master-local1/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/Monitoring/v3.19.8-4/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/curl/7.70.0-18/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/arrow/v20.0.0-alice1-2/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/Clang/v18.1.8-12/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/libuv/v1.40.0-18/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/TBB/v2021.5.0-18/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/FairRoot/v18.4.9-alice3-81/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/VMC/v2-0-90/include/vmc -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/libjalienO2/0.1.4-23/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/Common-O2/v1.6.3-18/include -Xcompiler "-fPIC -O2 " -allow-unsupported-compiler --expt-relaxed-constexpr --extended-lambda -Xcompiler -Wno-attributes -Wno-deprecated-gpu-targets --ftz=true -Xcudafe --diag_suppress=114 --allow-unsupported-compiler -Xcompiler "-O2 -g -DNDEBUG -Wno-unknown-warning-option" -O2 -g -DNDEBUG -Xptxas -O4 -Xcompiler -O4 -use_fast_math --ftz=true -std=c++20 "--generate-code=arch=compute_89,code=[compute_89,sm_89]" -Xcompiler=-fPIC -MD -MT Detectors/ITSMFT/ITS/tracking/GPU/cuda/CMakeFiles/O2lib-ITStrackingCUDA.dir/TrackingKernels.cu.o -MF Detectors/ITSMFT/ITS/tracking/GPU/cuda/CMakeFiles/O2lib-ITStrackingCUDA.dir/TrackingKernels.cu.o.d -x cu -rdc=true -c /home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU/cuda/TrackingKernels.cu -o Detectors/ITSMFT/ITS/tracking/GPU/cuda/CMakeFiles/O2lib-ITStrackingCUDA.dir/TrackingKernels.cu.o
/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU/cuda/TrackingKernels.cu(1254): error: name followed by "::" must be a class or namespace name
    o2::gpu::internal::GPUReconstructionChkErr(cub::DeviceTransform::Transform(nullptr, transform_bytes, d_temp_valid, cellNeighbours, newSize, gpu::pair_to_first<int, int>(), stream.get()), "/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU/cuda/TrackingKernels.cu", 1254, true);
                                                    ^

/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU/cuda/TrackingKernels.cu(1257): error: name followed by "::" must be a class or namespace name
    o2::gpu::internal::GPUReconstructionChkErr(cub::DeviceTransform::Transform(transform_temp, transform_bytes, d_temp_valid, cellNeighbours, newSize, gpu::pair_to_first<int, int>(), stream.get()), "/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU/cuda/TrackingKernels.cu", 1257, true);

f3sch · 2025-08-10T10:53:22Z

@shahor02, thanks. What is your local cuda & toolkit version? The CI builds with cuda 12.9.86

shahor02 · 2025-08-10T10:59:47Z

@f3sch thanks, indeed, my CUDA is older, will try to update:

Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

f3sch · 2025-08-10T11:09:46Z

@shahor02 sure, let me know if you don't succeed. I think we can also for now simply put a #ifdef around this section since it is not performance critical. I just changed it more in preparation for something else.

f3sch added 6 commits August 8, 2025 12:54

ITS: GPU: put trackleting properly on different streams

a41bfc4

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>

ITS: GPU: put cell finding on different streams

3139476

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>

ITS: GPU: put cell neighbour finding on different streams

85287db

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>

ITS: GPU: use external allocator for temp storage

5fc126f

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>

ITS: GPU: simplify stream synchronization

a299329

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>

ITS: GPU: resolve added TODOs

6e7242d

Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>

f3sch marked this pull request as ready for review August 8, 2025 11:22

f3sch requested review from fprino, mconcas and shahor02 as code owners August 8, 2025 11:22

mconcas approved these changes Aug 9, 2025

View reviewed changes

f3sch merged commit 80c4d14 into AliceO2Group:dev Aug 9, 2025
13 checks passed

f3sch deleted the its/gpu_ms branch August 9, 2025 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ITS: GPU: better use of streams #14563

ITS: GPU: better use of streams #14563

Uh oh!

f3sch commented Aug 8, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

mconcas commented Aug 9, 2025 •

edited

Loading

Uh oh!

f3sch commented Aug 9, 2025

Uh oh!

f3sch commented Aug 9, 2025

Uh oh!

Uh oh!

shahor02 commented Aug 9, 2025

Uh oh!

f3sch commented Aug 10, 2025

Uh oh!

shahor02 commented Aug 10, 2025

Uh oh!

f3sch commented Aug 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

ITS: GPU: better use of streams #14563

ITS: GPU: better use of streams #14563

Uh oh!

Conversation

f3sch commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

mconcas commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

f3sch commented Aug 9, 2025

Uh oh!

f3sch commented Aug 9, 2025

Uh oh!

Uh oh!

shahor02 commented Aug 9, 2025

Uh oh!

f3sch commented Aug 10, 2025

Uh oh!

shahor02 commented Aug 10, 2025

Uh oh!

f3sch commented Aug 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

f3sch commented Aug 8, 2025 •

edited

Loading

mconcas commented Aug 9, 2025 •

edited

Loading