Multi GPU support #71
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This adds
Torch::NN::DataParallelfor multi-GPU training, allowing automatic batch splitting across GPUs.What about torch-ddl?
I've missed the other PR adding multi-gpu (and even distributed) workloads :) I still think I should submit this, since it's much smaller changeset and has some value. Using multiple GPUs locally is simpler than setting up a cluster load for distributed learning.
Usage
Models that return loss
If the model returns a scalar loss (e.g.,
[logits, loss]), usedp_model.backwardinstead ofloss.backward:This is necessary because gathering scalar tensors across devices breaks the autograd graph. The
backwardmethod calls backward on each replica's loss separately, then reduces gradients to the original module.What's included
CUDA device management:
Torch::CUDA.current_device- get current CUDA device indexTorch::CUDA.set_device(id)- set current CUDA device (useful for testing devices)Torch::CUDA.synchronize- wait for all CUDA operations to completeTorch::CUDA.nccl_available?- check if NCCL is available (useful for checking if we can run DataParallel)DataParallel:
Torch::NN::DataParallel- wraps a module for multi-GPU trainingTorch::NN::Parallel.replicate- copies modules to multiple devicesTorch::NN::Parallel.parallel_apply- runs forward pass on replicas in parallelTorch::NN._scatter/Torch::NN._gather- splits and combines tensors across devices (internal methods, underscore notation mimics what's in pytorch)Testing
Verified with nanogpt-rb on 2 GPUs. Both GPUs were utilized, the speedup was...well, negative, since they were mismatched (RTX 4090 and GTX 1050 Ti) 😅 A better pairing should result in some improvement.