-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: Add multi-GPU support with data/model/tensor parallelism #1153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR fixes several issues that prevent DiffSynth-Studio from running on non-CUDA devices (Apple Silicon MPS and CPU): 1. base_pipeline.py: Check if empty_cache exists before calling it - Only CUDA has torch.cuda.empty_cache() - MPS and CPU don't have this method 2. siglip2_image_encoder.py: Remove hardcoded device="cuda" default - Now auto-detects device from model parameters - Falls back to specified device if provided 3. dinov3_image_encoder.py: Remove hardcoded device="cuda" default - Same fix as siglip2_image_encoder.py 4. vram/layers.py: Check if mem_get_info exists before calling it - Only CUDA and NPU have mem_get_info() - For MPS/CPU, assume enough memory is available These changes enable running Qwen-Image pipelines on Apple Silicon Macs and CPU-only machines without requiring any monkey-patching workarounds.
This commit adds comprehensive multi-GPU support for DiffSynth-Studio pipelines, enabling efficient distributed inference across multiple GPUs. New distributed module (diffsynth/distributed/): - parallel.py: Basic distributed utilities (init, barrier, broadcast, etc.) - tensor_parallel.py: Column/Row parallel linear layers for tensor parallelism - data_parallel.py: Data parallel utilities for batch processing - multi_gpu.py: High-level multi-GPU pipeline wrapper Three parallelism modes supported: 1. Model Parallel: Distribute different models to different GPUs - Automatically balances load based on model size - Custom device map support for fine-grained control 2. Data Parallel: Same model on all GPUs, process different batches - Scatter/gather utilities for batch distribution - Compatible with torchrun launcher 3. Tensor Parallel: Split large layers across GPUs - Column-parallel and row-parallel linear layers - Automatic layer selection based on feature size New methods in BasePipeline: - enable_multi_gpu(mode, device_map, tensor_parallel_layers) - get_model_distribution() - print_model_distribution() Example script: examples/multi_gpu_inference.py - Demonstrates all three parallelism modes - Includes custom device map example
Summary of ChangesHello @longkeyy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances DiffSynth-Studio by integrating robust multi-GPU capabilities. It provides flexible strategies for distributed inference, allowing users to leverage multiple GPUs for improved performance and scalability. The new distributed module supports model, data, and tensor parallelism, catering to various use cases from fitting large models into memory to accelerating batch processing. This change aims to make the framework more efficient and accessible for high-performance computing environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces comprehensive multi-GPU support, a significant and well-structured feature. The new distributed module provides clean abstractions for model, data, and tensor parallelism. The changes to existing code to accommodate multi-GPU environments are also thoughtful. My review focuses on a few key areas to enhance robustness and clarity: a critical bug in the data parallelism logic that could cause deadlocks, a couple of high-severity issues related to an unguarded import and an incomplete API, and a medium-severity issue in the tensor parallelism logic that limits its applicability. Addressing these points will make this excellent contribution even more solid.
| if rank == dst: | ||
| return {k: gather_outputs(v, dim, dst) for k, v in output.items()} | ||
| return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical bug in the gather_outputs function when handling dictionaries. The recursive call to gather_outputs for the dictionary values is only performed on the destination rank (dst). Since gather_outputs may use collective communication operations (like all_gather), which must be called by all processes, this will cause a deadlock when ranks other than dst do not participate.
To fix this, the recursive call must be executed on all ranks.
| if rank == dst: | |
| return {k: gather_outputs(v, dim, dst) for k, v in output.items()} | |
| return output | |
| # Recursively gather each value in the dictionary. | |
| # This must be executed on all ranks to avoid deadlocks in collective calls. | |
| return {k: gather_outputs(v, dim, dst) for k, v in output.items()} |
| def enable_multi_gpu( | ||
| self, | ||
| mode: str = "auto", | ||
| device_map: Optional[Dict[str, str]] = None, | ||
| tensor_parallel_layers: Optional[List[str]] = None, | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enable_multi_gpu function has some inconsistencies that could confuse users:
- The imports
MultiGPUPipelineandenable_multi_gpufrom..distributedon line 347 are unused within this function. - The function's docstring states it supports
auto,model,tensor, anddatamodes. However, the implementation only contains logic formode="model". Other modes will silently do nothing.
This discrepancy between documentation and behavior can lead to unexpected results. Please consider either:
- Removing the unused imports and updating the docstring to clarify that this method only handles model parallelism.
- Refactoring this method to correctly handle all documented modes, possibly by delegating to the more comprehensive logic in
diffsynth/distributed/multi_gpu.py.
| import torch_npu | ||
| torch.npu.set_device(local_rank) | ||
| _DEVICE = torch.device(f"npu:{local_rank}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import of torch_npu is not guarded. If device_type is 'npu' but the torch_npu package is not installed, this will raise an ImportError and crash the program. It's better to wrap this import in a try...except ImportError block to handle this case gracefully and provide a clear error message to the user.
| import torch_npu | |
| torch.npu.set_device(local_rank) | |
| _DEVICE = torch.device(f"npu:{local_rank}") | |
| try: | |
| import torch_npu | |
| torch.npu.set_device(local_rank) | |
| _DEVICE = torch.device(f"npu:{local_rank}") | |
| except ImportError: | |
| raise ImportError("NPU device type requested, but torch_npu is not installed.") |
| best_gpu = min( | ||
| range(num_gpus), | ||
| key=lambda i: gpu_usage[i] if gpu_usage[i] + size <= max_memory_per_gpu.get(i, float('inf')) else float('inf') | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to find the best GPU is a bit complex and hard to read. Using float('inf') to disqualify GPUs that don't have enough memory is clever, but it makes the key function less intuitive. Consider refactoring this for clarity, for example by pre-filtering the list of available GPUs.
# Find a GPU that has enough space
available_gpus = [i for i in range(num_gpus) if gpu_usage[i] + size <= max_memory_per_gpu.get(i, 0)]
if not available_gpus:
# No GPU has enough space, use CPU offload
device_map[name] = "cpu"
continue
# Assign to the available GPU with the least usage
best_gpu = min(available_gpus, key=lambda i: gpu_usage[i])| layer.in_features % world_size == 0 and | ||
| layer.out_features % world_size == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition to decide whether a layer should be parallelized is too strict. It requires both in_features and out_features to be divisible by world_size.
However, for column parallelism, only out_features needs to be divisible, and for row parallelism, only in_features is required to be divisible. Since TensorParallelLinear automatically chooses the parallelism mode based on which dimension is larger, the check should be relaxed to match this logic. This will enable tensor parallelism for more layers.
| layer.in_features % world_size == 0 and | |
| layer.out_features % world_size == 0 | |
| ((layer.out_features >= layer.in_features and layer.out_features % world_size == 0) or | |
| (layer.out_features < layer.in_features and layer.in_features % world_size == 0)) |
Summary
This PR adds comprehensive multi-GPU support for DiffSynth-Studio pipelines, enabling efficient distributed inference across multiple GPUs.
New Features
New distributed module (
diffsynth/distributed/):parallel.pytensor_parallel.pydata_parallel.pymulti_gpu.pyThree parallelism modes supported:
Model Parallel: Distribute different models to different GPUs
Data Parallel: Same model on all GPUs, process different batches
Tensor Parallel: Split large layers across GPUs
New methods in BasePipeline:
enable_multi_gpu(mode, device_map, tensor_parallel_layers)get_model_distribution()print_model_distribution()Usage Examples
For Data Parallel and Tensor Parallel, use with
torchrun:Test Plan
Files Changed
diffsynth/__init__.py- Export distributed modulediffsynth/diffusion/base_pipeline.py- Add multi-GPU methodsdiffsynth/distributed/- New distributed module (5 files)examples/multi_gpu_inference.py- Usage example