-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Hi, I've encountered a msccl issue using the latest nccl/nccl-test/msccl repo for allreduce test.
// msccl install step
git clone https://github.com/microsoft/msccl.git
cd msccl/
make -j src.build
cd ../
// nccl install step
git clone https://github.com/nvidia/nccl-tests.git
cd nccl-tests/
make MPI=1 NCCL_HOME=../msccl/build/ -j
cd ../
// msccl-tools install step
git clone https://github.com/microsoft/msccl-tools.git
cd msccl-tools/
pip install .
cd ../
python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml
cd ../
// allreduce test
mpirun -np 8 -x LD_LIBRARY_PATH=msccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x MSCCL_XML_FILES=test.xml -x NCCL_ALGO=MSCCL,RING,TREE nccl-tests/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0
// Error
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[14987,1],0]
Exit code: 127
Can you help me figure out this issue?