-
Notifications
You must be signed in to change notification settings - Fork 29
Description
I’m writing to help anyone who faces the error “RuntimeError: Distributed package doesn’t have NCCL built in” when running the command:
python partfield_inference.py -c configs/final/demo.yaml --opts continue_ckpt model/model_objaverse.ckpt result_name partfield_features/objaverse dataset.data_path data/objaverse_samples
while trying to extract the feature field. I ran into the same issue.
This error happens because the Trainer API in PyTorch Lightning, used inside partfield_inference.py, defaults to using the DDP strategy to distribute training across GPUs. However, on a single-GPU machine, especially on Windows, this results in the NCCL error since NCCL is not supported on Windows.
So, I resolved the issue by changing:
strategy = DDPStrategy(find_unused_parameters=True)
to:
strategy = "auto"
inside the file partfield_inference.py.
According to the PyTorch Lightning documentation, "auto" allows the Trainer to automatically determine whether to use distributed training based on the number of available GPUs which avoids calling DDP on single-GPU setups.
Best,
Amir