Skip to content

Conversation

@jameslamb
Copy link
Member

@jameslamb jameslamb commented Jan 5, 2026

Contributes to rapidsai/build-planning#236

Tests that CI here will work with the changes from rapidsai/shared-workflows#483,
switches CUDA 13 builds to CUDA 13.1.0 and adds some CUDA 13.1.0 test jobs.

@copy-pr-bot

This comment was marked as resolved.

@jameslamb
Copy link
Member Author

pip-based builds (but not conda), are failing like this:

  -- Found cuco: /tmp/pip-build-env-ycpr1l82/normal/lib/python3.13/site-packages/libraft/lib64/cmake/cuco/cuco-config.cmake (found version "0.0.1")
  CMake Error at /tmp/pip-build-env-ycpr1l82/normal/lib/python3.13/site-packages/cmake/data/share/cmake-4.2/Modules/FindPackageHandleStandardArgs.cmake:290 (message):
    Could NOT find NCCL (missing: NCCL_LIBRARY NCCL_INCLUDE_DIR)
  Call Stack (most recent call first):
    /tmp/pip-build-env-ycpr1l82/normal/lib/python3.13/site-packages/cmake/data/share/cmake-4.2/Modules/FindPackageHandleStandardArgs.cmake:654 (_FPHSA_FAILURE_MESSAGE)
    /tmp/pip-build-env-ycpr1l82/normal/lib/python3.13/site-packages/libraft/lib64/cmake/raft/FindNCCL.cmake:69 (find_package_handle_standard_args)
    /tmp/pip-build-env-ycpr1l82/normal/lib/python3.13/site-packages/cmake/data/share/cmake-4.2/Modules/CMakeFindDependencyMacro.cmake:93 (find_package)
    /tmp/pip-build-env-ycpr1l82/normal/lib/python3.13/site-packages/cmake/data/share/cmake-4.2/Modules/CMakeFindDependencyMacro.cmake:125 (__find_dependency_common)
    /tmp/pip-build-env-ycpr1l82/normal/lib/python3.13/site-packages/libraft/lib64/cmake/raft/raft-distributed-dependencies.cmake:11 (find_dependency)
    /tmp/pip-build-env-ycpr1l82/normal/lib/python3.13/site-packages/libraft/lib64/cmake/raft/raft-config.cmake:84 (include)
    CMakeLists.txt:28 (find_package)

(build link)

That's happening on the cuml CUDA 13.1.0 PR too: rapidsai/cuml#7650

@jameslamb
Copy link
Member Author

Looking at a recent successful PR CI run with CUDA 13.0, to see where NCCL was found. For devcontainers and wheels CI, looks like they're system installations.

In pip devcontainers:

...
2025-12-20T20:07:45.9911177Z [2025-12-20T20:07:45.990Z] #16 2.921 W: http://ppa.launchpad.net/git-core/ppa/ubuntu/dists/noble/InRelease: Signature by key E1DD270288B4E6030699E45FA1715D88E1DF1F24 uses weak algorithm (rsa1024)
2025-12-20T20:07:45.9911973Z #16 2.922 Installing dev CUDA toolkit...
2025-12-20T20:07:45.9912169Z 
2025-12-20T20:07:45.9912175Z 
2025-12-20T20:07:51.1131583Z [2025-12-20T20:07:51.112Z] #16 8.194 Installing packages: cuda-nvml-dev-13-0 cuda-compiler-13-0 cuda-minimal-build-13-0 cuda-command-line-tools-13-0 cuda-nsight-compute-13-0 cuda-nsight-systems-13-0 cuda-cudart-dev-13-0 cuda-nvrtc-dev-13-0 libnvjitlink-dev-13-0 cuda-opencl-dev-13-0 libcublas-dev-13-0 libcusparse-dev-13-0 libcufft-dev-13-0 libcufile-dev-13-0 libcurand-dev-13-0 libcusolver-dev-13-0 libnpp-dev-13-0 libnvjpeg-dev-13-0 libnccl2=*+cuda13.0 libnccl-dev=*+cuda13.0
...
  #16 9.500 libnccl2 is already the newest version (2.28.9-1+cuda13.0).
  #16 9.500 libnccl-dev is already the newest version (2.28.9-1+cuda13.0).
...
  [2025-12-20T20:08:34.939Z] Collecting nvidia-nccl-cu13>=2.19 (from -r /tmp/rapids.requirements.txt (line 11))
  [2025-12-20T20:08:35.057Z]   Downloading https://pypi.nvidia.com/nvidia-nccl-cu13/nvidia_nccl_cu13-2.28.9-py3-none-manylinux_2_18_x86_64.whl (196.5 MB)
...
  -- Found NCCL: /usr/lib/x86_64-linux-gnu/libnccl.so

(build link)

And in raft-dask wheel builds:

...
  Collecting nvidia-nccl-cu13>=2.19
    Downloading https://pypi.nvidia.com/nvidia-nccl-cu13/nvidia_nccl_cu13-2.28.9-py3-none-manylinux_2_18_aarch64.whl (196.6 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.6/196.6 MB 243.3 MB/s  0:00:00
...
  -- Found NCCL: /usr/lib64/libnccl.so

(build link)

On this PR, the devcontainer build doesn't have those install lines showing libnccl-2 / libnccl-dev, and in the wheel builds I don't see that system installation.

Checking the new ci-wheel images, it seems we're no longer shipping a system install of NCCL.

$ docker run --rm rapidsai/ci-wheel:cuda13.0.2-rockylinux8-py3.10 find / -name 'libnccl.so*'
/usr/lib64/libnccl.so.2.28.9
/usr/lib64/libnccl.so
/usr/lib64/libnccl.so.2

$ docker run --rm rapidsai/ci-wheel:cuda13.1.0-rockylinux8-py3.10 find / -name 'libnccl.so*'
# (empty)

@jakirkham
Copy link
Member

The latest version of NCCL is 2.29.2-1. This is what the Conda builds are picking up

Perhaps we still need an update on the wheel side?

@jameslamb
Copy link
Member Author

jameslamb commented Jan 6, 2026

Perhaps we still need an update on the wheel side?

I don't think the issue is one of versions, because the CMake code looking for NCCL isn't looking for any specific version. It should generally be ok for the nvidia-nccl-cu{c12,13} installed in the build environment to be a little behind what's on conda-forge.

BUT... I'm noticing now that we are installing that library from wheels in the build environment, but end up picking it up from system libraries. Looks like it may have been that way since #2629 ... I'm not sure which is the bug, that we're installing the wheels in the build environment despite not needing them, or that RAFT wheel builds are finding system-installed NCCL when we want them to find the one from wheels.

"installcuRAND": true,
"installcuSPARSE": true
"installcuSPARSE": true,
"installNCCL": true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not working, and I see why.

The code processing this flag is explicitly looking for packages with +cuda13.1 in their versions.

if [ "${INSTALLNCCL:-false}" = true ] \
&& test -n "$(apt-cache search libnccl2 2>/dev/null)" \
&& apt-cache policy libnccl2 2>/dev/null | grep -q "+${cuda_tag}"; then
    PKGS+=("libnccl2=*+${cuda_tag}");
    if [ "${INSTALLDEVPACKAGES:-false}" = true ]; then
        PKGS+=("libnccl-dev=*+${cuda_tag}");
    fi
fi

(rapidsai/devcontainers - features/src/cuda/install.sh)

And those don't exist today.

So options I see:

  • relax that condition
  • add a manual install in the devcontainer.json here and in other RAPIDS repos (like in postCreateCommand or something)
  • get RAFT wheel builds using the files delivered by nvidia-nccl-cu{12,13} wheels (build and test against CUDA 13.1.0 #2912 (comment))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code is 3 years old. Sounds worthwhile to revisit whether it can be changed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure. I hacked around it here like this: fabf7f9

which seems to at least allow all RAFT components to compile: https://github.com/rapidsai/raft/actions/runs/20757771141/job/59604424645?pr=2912

But it definitely feels like a hack. Opened rapidsai/devcontainers#644 to discuss

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided offline to skip CUDA 13.1 pip devcontainers in CI for now to keep moving, and to restore them hopefully soon when there are CUDA 13.1 apt packages for NCCL: rapidsai/devcontainers#644 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a56e104

Thinking about this some more, we should at least still include the CUDA 13.1 devcontainers in the affected repos, even if we can't test them in CI, so they can be used for other testing. For example, the cuda13.1-conda devcontainers are not affected by this issue and could be used to debug CUDA 13.1 builds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are now apt packages for NCCL + CUDA 13.1, so no need to skip anything any more 😁

#2912 (comment)

@jameslamb
Copy link
Member Author

/ok to test

@jameslamb jameslamb changed the title WIP: build and test against CUDA 13.1.0 build and test against CUDA 13.1.0 (except devcontainers) Jan 7, 2026
@jameslamb jameslamb requested a review from gforsyth January 7, 2026 14:49
@jameslamb jameslamb marked this pull request as ready for review January 7, 2026 14:49
@jameslamb jameslamb requested review from a team as code owners January 7, 2026 14:49
@jameslamb
Copy link
Member Author

Just pushed 946981f

There are now libnccl-devel=*+cuda13.1 apt packages available, so we should be able to use CUDA 13.1 devcontainers here + in CI and not need a follow-up PR later.

@jameslamb jameslamb changed the title build and test against CUDA 13.1.0 (except devcontainers) build and test against CUDA 13.1.0 Jan 7, 2026
@jameslamb
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 2ca8b09 into rapidsai:main Jan 7, 2026
109 checks passed
@jameslamb jameslamb deleted the cuda13.1.0-workflows branch January 7, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Development

Successfully merging this pull request may close these issues.

3 participants