Skip to content

Undefined symbol: ncclCommRegister #64

@MC952-arch

Description

@MC952-arch

Hi, I've encountered a msccl issue using the latest nccl/nccl-test/msccl repo for allreduce test.

// msccl install step
git clone https://github.com/microsoft/msccl.git
cd msccl/
make -j src.build
cd ../

// nccl install step
git clone https://github.com/nvidia/nccl-tests.git
cd nccl-tests/
make MPI=1 NCCL_HOME=../msccl/build/ -j
cd ../

// msccl-tools install step
git clone https://github.com/microsoft/msccl-tools.git
cd msccl-tools/
pip install .
cd ../
python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml
cd ../

// allreduce test
mpirun -np 8 -x LD_LIBRARY_PATH=msccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x MSCCL_XML_FILES=test.xml -x NCCL_ALGO=MSCCL,RING,TREE nccl-tests/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0

// Error
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[14987,1],0]
Exit code: 127

Can you help me figure out this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions