-
Notifications
You must be signed in to change notification settings - Fork 222
build and test against CUDA 13.1.0 #2912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
|
That's happening on the |
|
Looking at a recent successful PR CI run with CUDA 13.0, to see where NCCL was found. For devcontainers and wheels CI, looks like they're system installations. In And in On this PR, the devcontainer build doesn't have those install lines showing Checking the new $ docker run --rm rapidsai/ci-wheel:cuda13.0.2-rockylinux8-py3.10 find / -name 'libnccl.so*'
/usr/lib64/libnccl.so.2.28.9
/usr/lib64/libnccl.so
/usr/lib64/libnccl.so.2
$ docker run --rm rapidsai/ci-wheel:cuda13.1.0-rockylinux8-py3.10 find / -name 'libnccl.so*'
# (empty) |
|
The latest version of NCCL is Perhaps we still need an update on the wheel side? |
I don't think the issue is one of versions, because the CMake code looking for NCCL isn't looking for any specific version. It should generally be ok for the BUT... I'm noticing now that we are installing that library from wheels in the build environment, but end up picking it up from system libraries. Looks like it may have been that way since #2629 ... I'm not sure which is the bug, that we're installing the wheels in the build environment despite not needing them, or that RAFT wheel builds are finding system-installed NCCL when we want them to find the one from wheels. |
| "installcuRAND": true, | ||
| "installcuSPARSE": true | ||
| "installcuSPARSE": true, | ||
| "installNCCL": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not working, and I see why.
The code processing this flag is explicitly looking for packages with +cuda13.1 in their versions.
if [ "${INSTALLNCCL:-false}" = true ] \
&& test -n "$(apt-cache search libnccl2 2>/dev/null)" \
&& apt-cache policy libnccl2 2>/dev/null | grep -q "+${cuda_tag}"; then
PKGS+=("libnccl2=*+${cuda_tag}");
if [ "${INSTALLDEVPACKAGES:-false}" = true ]; then
PKGS+=("libnccl-dev=*+${cuda_tag}");
fi
fi(rapidsai/devcontainers - features/src/cuda/install.sh)
And those don't exist today.
So options I see:
- relax that condition
- add a manual install in the
devcontainer.jsonhere and in other RAPIDS repos (like inpostCreateCommandor something) - get RAFT wheel builds using the files delivered by
nvidia-nccl-cu{12,13}wheels (build and test against CUDA 13.1.0 #2912 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That code is 3 years old. Sounds worthwhile to revisit whether it can be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure. I hacked around it here like this: fabf7f9
which seems to at least allow all RAFT components to compile: https://github.com/rapidsai/raft/actions/runs/20757771141/job/59604424645?pr=2912
But it definitely feels like a hack. Opened rapidsai/devcontainers#644 to discuss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided offline to skip CUDA 13.1 pip devcontainers in CI for now to keep moving, and to restore them hopefully soon when there are CUDA 13.1 apt packages for NCCL: rapidsai/devcontainers#644 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a56e104
Thinking about this some more, we should at least still include the CUDA 13.1 devcontainers in the affected repos, even if we can't test them in CI, so they can be used for other testing. For example, the cuda13.1-conda devcontainers are not affected by this issue and could be used to debug CUDA 13.1 builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are now apt packages for NCCL + CUDA 13.1, so no need to skip anything any more 😁
|
/ok to test |
|
Just pushed 946981f There are now |
|
/merge |
Contributes to rapidsai/build-planning#236
Tests that CI here will work with the changes from rapidsai/shared-workflows#483,
switches CUDA 13 builds to CUDA 13.1.0 and adds some CUDA 13.1.0 test jobs.