Skip to content

Conversation

@f3sch
Copy link
Collaborator

@f3sch f3sch commented Aug 8, 2025

Makes better use of multiple streams. Works also in deterministic mode. Overall gain on 100 TFs is only 15% of the total time (dominated by the track fitting which was not touched but individual speedup is more).

Total time in its_time_benchmarks_old.txt: 108960.237939
Total time in its_time_benchmarks_new.txt: 91904.873816

  Difference (file2 - file1): -17055.364123
  Relative change: -15.65%

Metric: cell_finding
  its_time_benchmarks_old.txt: 5229.111419 (4.80% of total)
  its_time_benchmarks_new.txt: 3776.888696 (4.11% of total)
  Difference (file2 - file1): -1452.222723
  Relative change: -27.77%

Metric: neighbour_finding
  its_time_benchmarks_old.txt: 15058.566217 (13.82% of total)
  its_time_benchmarks_new.txt: 13635.751449 (14.84% of total)
  Difference (file2 - file1): -1422.814768
  Relative change: -9.45%

Metric: road_finding
  its_time_benchmarks_old.txt: 54911.194296 (50.40% of total)
  its_time_benchmarks_new.txt: 53212.793432 (57.90% of total)
  Difference (file2 - file1): -1698.400864
  Relative change: -3.09%

Metric: timeframe_initialisation
  its_time_benchmarks_old.txt: 1323.038784 (1.21% of total)
  its_time_benchmarks_new.txt: 1205.854911 (1.31% of total)
  Difference (file2 - file1): -117.183873
  Relative change: -8.86%

Metric: tracklet_finding
  its_time_benchmarks_old.txt: 32438.327223 (29.77% of total)
  its_time_benchmarks_new.txt: 20073.585328 (21.84% of total)
  Difference (file2 - file1): -12364.741895
  Relative change: -38.12%

One other problem I observed that after the counting phase the needed cudaMalloAsync still force the insert phase to be essentially sequential since its blocking (or also the temp. allocations when calculating the LUTs). If one can resolve this the speedup would double.

f3sch added 6 commits August 8, 2025 12:54
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
Signed-off-by: Felix Schlepper <felix.schlepper@cern.ch>
@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2025

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

@f3sch f3sch marked this pull request as ready for review August 8, 2025 11:22
@mconcas
Copy link
Collaborator

mconcas commented Aug 9, 2025

"only" 15% seems very good to me! Is it on pp or Pb-Pb?

@f3sch
Copy link
Collaborator Author

f3sch commented Aug 9, 2025

35 kHz Pb-Pb, in principle yes 15% is nice indeed but I think with a bit better handling of the scheduling this can go to 30%.

@f3sch
Copy link
Collaborator Author

f3sch commented Aug 9, 2025

You can see that the counting phase indeed works nicely on different streams but then with the memory allocations, the compute phase is essentially sequential again.
image

@f3sch f3sch merged commit 80c4d14 into AliceO2Group:dev Aug 9, 2025
13 checks passed
@f3sch f3sch deleted the its/gpu_ms branch August 9, 2025 14:51
@shahor02
Copy link
Collaborator

shahor02 commented Aug 9, 2025

@f3sch , I get

FAILED: Detectors/ITSMFT/ITS/tracking/GPU/cuda/CMakeFiles/O2lib-ITStrackingCUDA.dir/TrackingKernels.cu.o 
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DBOOST_ASIO_HAS_HAS_STD_CHRONO -DBOOST_ATOMIC_DYN_LINK -DBOOST_ATOMIC_NO_LIB -DBOOST_CONTAINER_DYN_LINK -DBOOST_CONTAINER_NO_LIB -DBOOST_ERROR_CODE_HEADER_ONLY -DBOOST_FILESYSTEM_DYN_LINK -DBOOST_FILESYSTEM_NO_LIB -DBOOST_IOSTREAMS_DYN_LINK -DBOOST_IOSTREAMS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_REGEX_DYN_LINK -DBOOST_REGEX_NO_LIB -DDPL_ENABLE_BACKTRACE -DENABLE_UPGRADES -DFAIRMQ_HAS_STD_FILESYSTEM=1 -DFAIRMQ_HAS_STD_PMR=1 -DFMT_SHARED -DGENERATORS_WITH_HEPMC3 -DGENERATORS_WITH_PYTHIA8 -DGPUCA_GPUTYPE_AMPERE -DO2lib_ITStrackingCUDA_EXPORTS -DRANS_ENABLE_JSON -DRANS_OPENMP -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU -I/home/shahoian/alice/sw/BUILD/O2-latest/O2/Detectors/ITSMFT/ITS/tracking/GPU/cuda -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/Common -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Logger/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/Constants/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Detectors/ITSMFT/common/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/common/base/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/MathUtils/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Detectors/Common/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/Utils -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Headers/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/MemoryResources/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Utilities/rANS/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Core/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/simulation/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Foundation/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Foundation/3rdparty/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Utilities/PCG/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Foundation/3rdparty/x9 -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/Utils/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/common/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/SimConfig/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Parameters/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/Types/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/CCDB/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Calibration/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Reconstruction/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Common/Field/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Detectors/CTP/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/base/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/Base/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/GPUTracking -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/GPUTracking/Definitions -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/GPUTracking/DataTypes -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/TPCFastTransformation -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/GPU/TPCFastTransformation/devtools -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/TPC/spacecharge/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/TPC/base/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Detectors/TPC/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Utilities/DataSampling/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Algorithm/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/Raw/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Framework/Utils/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/Upgrades/ITS3/base/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/reconstruction/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/common/reconstruction/include -I/home/shahoian/alice/sw/SOURCES/O2/dev/0/DataFormats/Detectors/ITSMFT/ITS/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/fmt/11.1.2-4/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/FairLogger/v2.1.0-5/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/ms_gsl/4.0.0-16/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/ROOT/v6-32-06-alice8-5/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/Vc/1.4.5-3/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/boost/v1.83.0-alice2-35/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/FairMQ/v1.10.0-2/include/fairmq -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/FairMQ/v1.10.0-2/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/abseil/20240722.0-2/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/RapidJSON/v1.1.0-alice2-19/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/Configuration/master-local1/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/Monitoring/v3.19.8-4/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/curl/7.70.0-18/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/arrow/v20.0.0-alice1-2/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/Clang/v18.1.8-12/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/libuv/v1.40.0-18/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/TBB/v2021.5.0-18/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/FairRoot/v18.4.9-alice3-81/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/VMC/v2-0-90/include/vmc -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/libjalienO2/0.1.4-23/include -isystem /home/shahoian/alice/sw/ubuntu2204_x86-64/Common-O2/v1.6.3-18/include -Xcompiler "-fPIC -O2 " -allow-unsupported-compiler --expt-relaxed-constexpr --extended-lambda -Xcompiler -Wno-attributes -Wno-deprecated-gpu-targets --ftz=true -Xcudafe --diag_suppress=114 --allow-unsupported-compiler -Xcompiler "-O2 -g -DNDEBUG -Wno-unknown-warning-option" -O2 -g -DNDEBUG -Xptxas -O4 -Xcompiler -O4 -use_fast_math --ftz=true -std=c++20 "--generate-code=arch=compute_89,code=[compute_89,sm_89]" -Xcompiler=-fPIC -MD -MT Detectors/ITSMFT/ITS/tracking/GPU/cuda/CMakeFiles/O2lib-ITStrackingCUDA.dir/TrackingKernels.cu.o -MF Detectors/ITSMFT/ITS/tracking/GPU/cuda/CMakeFiles/O2lib-ITStrackingCUDA.dir/TrackingKernels.cu.o.d -x cu -rdc=true -c /home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU/cuda/TrackingKernels.cu -o Detectors/ITSMFT/ITS/tracking/GPU/cuda/CMakeFiles/O2lib-ITStrackingCUDA.dir/TrackingKernels.cu.o
/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU/cuda/TrackingKernels.cu(1254): error: name followed by "::" must be a class or namespace name
    o2::gpu::internal::GPUReconstructionChkErr(cub::DeviceTransform::Transform(nullptr, transform_bytes, d_temp_valid, cellNeighbours, newSize, gpu::pair_to_first<int, int>(), stream.get()), "/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU/cuda/TrackingKernels.cu", 1254, true);
                                                    ^

/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU/cuda/TrackingKernels.cu(1257): error: name followed by "::" must be a class or namespace name
    o2::gpu::internal::GPUReconstructionChkErr(cub::DeviceTransform::Transform(transform_temp, transform_bytes, d_temp_valid, cellNeighbours, newSize, gpu::pair_to_first<int, int>(), stream.get()), "/home/shahoian/alice/sw/SOURCES/O2/dev/0/Detectors/ITSMFT/ITS/tracking/GPU/cuda/TrackingKernels.cu", 1257, true);

@f3sch
Copy link
Collaborator Author

f3sch commented Aug 10, 2025

@shahor02, thanks. What is your local cuda & toolkit version? The CI builds with cuda 12.9.86

@shahor02
Copy link
Collaborator

@f3sch thanks, indeed, my CUDA is older, will try to update:

Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

@f3sch
Copy link
Collaborator Author

f3sch commented Aug 10, 2025

@shahor02 sure, let me know if you don't succeed. I think we can also for now simply put a #ifdef around this section since it is not performance critical. I just changed it more in preparation for something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants