Skip to content

RuntimeError: Distributed package doesn't have NCCL built in #13

@AHHHZ975

Description

@AHHHZ975

I’m writing to help anyone who faces the error “RuntimeError: Distributed package doesn’t have NCCL built in” when running the command:

python partfield_inference.py -c configs/final/demo.yaml --opts continue_ckpt model/model_objaverse.ckpt result_name partfield_features/objaverse dataset.data_path data/objaverse_samples

while trying to extract the feature field. I ran into the same issue.

This error happens because the Trainer API in PyTorch Lightning, used inside partfield_inference.py, defaults to using the DDP strategy to distribute training across GPUs. However, on a single-GPU machine, especially on Windows, this results in the NCCL error since NCCL is not supported on Windows.

So, I resolved the issue by changing:

strategy = DDPStrategy(find_unused_parameters=True)

to:
strategy = "auto"

inside the file partfield_inference.py.

According to the PyTorch Lightning documentation, "auto" allows the Trainer to automatically determine whether to use distributed training based on the number of available GPUs which avoids calling DDP on single-GPU setups.

Best,
Amir

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions