feat: Add multi-GPU support with data/model/tensor parallelism #1153

longkeyy · 2025-12-21T18:09:03Z

Summary

This PR adds comprehensive multi-GPU support for DiffSynth-Studio pipelines, enabling efficient distributed inference across multiple GPUs.

New Features

New distributed module (diffsynth/distributed/):

File	Description
`parallel.py`	Basic distributed utilities (init, barrier, broadcast, all_reduce, all_gather)
`tensor_parallel.py`	Column/Row parallel linear layers for tensor parallelism
`data_parallel.py`	Data parallel utilities for batch processing
`multi_gpu.py`	High-level multi-GPU pipeline wrapper

Three parallelism modes supported:

Model Parallel: Distribute different models to different GPUs
- Automatically balances load based on model size
- Custom device map support for fine-grained control
Data Parallel: Same model on all GPUs, process different batches
- Scatter/gather utilities for batch distribution
- Compatible with torchrun launcher
Tensor Parallel: Split large layers across GPUs
- Column-parallel and row-parallel linear layers
- Automatic layer selection based on feature size

New methods in BasePipeline:

enable_multi_gpu(mode, device_map, tensor_parallel_layers)
get_model_distribution()
print_model_distribution()

Usage Examples

# Model Parallel - auto distribution
pipe = QwenImagePipeline.from_pretrained(...)
pipe.enable_multi_gpu(mode="model")

# Model Parallel - custom device map
pipe.enable_multi_gpu(
    mode="model",
    device_map={
        "dit": "cuda:0",
        "text_encoder": "cuda:1",
        "vae": "cuda:1",
    }
)

# Check distribution
pipe.print_model_distribution()

For Data Parallel and Tensor Parallel, use with torchrun:

torchrun --nproc_per_node=2 script.py --mode data

Test Plan

Model parallel mode tested with 2 GPUs
Custom device map functionality verified
Backward compatible - no changes to existing single-GPU workflows
Data parallel mode with torchrun
Tensor parallel mode with torchrun

Files Changed

diffsynth/__init__.py - Export distributed module
diffsynth/diffusion/base_pipeline.py - Add multi-GPU methods
diffsynth/distributed/ - New distributed module (5 files)
examples/multi_gpu_inference.py - Usage example

This PR fixes several issues that prevent DiffSynth-Studio from running on non-CUDA devices (Apple Silicon MPS and CPU): 1. base_pipeline.py: Check if empty_cache exists before calling it - Only CUDA has torch.cuda.empty_cache() - MPS and CPU don't have this method 2. siglip2_image_encoder.py: Remove hardcoded device="cuda" default - Now auto-detects device from model parameters - Falls back to specified device if provided 3. dinov3_image_encoder.py: Remove hardcoded device="cuda" default - Same fix as siglip2_image_encoder.py 4. vram/layers.py: Check if mem_get_info exists before calling it - Only CUDA and NPU have mem_get_info() - For MPS/CPU, assume enough memory is available These changes enable running Qwen-Image pipelines on Apple Silicon Macs and CPU-only machines without requiring any monkey-patching workarounds.

This commit adds comprehensive multi-GPU support for DiffSynth-Studio pipelines, enabling efficient distributed inference across multiple GPUs. New distributed module (diffsynth/distributed/): - parallel.py: Basic distributed utilities (init, barrier, broadcast, etc.) - tensor_parallel.py: Column/Row parallel linear layers for tensor parallelism - data_parallel.py: Data parallel utilities for batch processing - multi_gpu.py: High-level multi-GPU pipeline wrapper Three parallelism modes supported: 1. Model Parallel: Distribute different models to different GPUs - Automatically balances load based on model size - Custom device map support for fine-grained control 2. Data Parallel: Same model on all GPUs, process different batches - Scatter/gather utilities for batch distribution - Compatible with torchrun launcher 3. Tensor Parallel: Split large layers across GPUs - Column-parallel and row-parallel linear layers - Automatic layer selection based on feature size New methods in BasePipeline: - enable_multi_gpu(mode, device_map, tensor_parallel_layers) - get_model_distribution() - print_model_distribution() Example script: examples/multi_gpu_inference.py - Demonstrates all three parallelism modes - Includes custom device map example

gemini-code-assist · 2025-12-21T18:09:26Z

Summary of Changes

Hello @longkeyy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances DiffSynth-Studio by integrating robust multi-GPU capabilities. It provides flexible strategies for distributed inference, allowing users to leverage multiple GPUs for improved performance and scalability. The new distributed module supports model, data, and tensor parallelism, catering to various use cases from fitting large models into memory to accelerating batch processing. This change aims to make the framework more efficient and accessible for high-performance computing environments.

Highlights

Comprehensive Multi-GPU Support: Introduces a new diffsynth/distributed/ module to enable efficient distributed inference across multiple GPUs for DiffSynth-Studio pipelines.
Three Parallelism Modes: Implements support for Model Parallelism (distributing different models to different GPUs), Data Parallelism (processing different batches with the same model on all GPUs), and Tensor Parallelism (splitting large layers across GPUs).
Enhanced Pipeline API: Adds enable_multi_gpu, get_model_distribution, and print_model_distribution methods to BasePipeline for easy configuration and monitoring of multi-GPU setups.
Automatic Device Mapping: Includes functionality to automatically balance model components across GPUs based on their size, alongside support for fine-grained custom device maps.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces comprehensive multi-GPU support, a significant and well-structured feature. The new distributed module provides clean abstractions for model, data, and tensor parallelism. The changes to existing code to accommodate multi-GPU environments are also thoughtful. My review focuses on a few key areas to enhance robustness and clarity: a critical bug in the data parallelism logic that could cause deadlocks, a couple of high-severity issues related to an unguarded import and an incomplete API, and a medium-severity issue in the tensor parallelism logic that limits its applicability. Addressing these points will make this excellent contribution even more solid.

gemini-code-assist · 2025-12-21T18:12:09Z

diffsynth/distributed/data_parallel.py

+        if rank == dst:
+            return {k: gather_outputs(v, dim, dst) for k, v in output.items()}
+        return output


There is a critical bug in the gather_outputs function when handling dictionaries. The recursive call to gather_outputs for the dictionary values is only performed on the destination rank (dst). Since gather_outputs may use collective communication operations (like all_gather), which must be called by all processes, this will cause a deadlock when ranks other than dst do not participate.

To fix this, the recursive call must be executed on all ranks.

Suggested change

if rank == dst:

return {k: gather_outputs(v, dim, dst) for k, v in output.items()}

return output

# Recursively gather each value in the dictionary.

# This must be executed on all ranks to avoid deadlocks in collective calls.

return {k: gather_outputs(v, dim, dst) for k, v in output.items()}

gemini-code-assist · 2025-12-21T18:12:09Z

diffsynth/diffusion/base_pipeline.py

+    def enable_multi_gpu(
+        self,
+        mode: str = "auto",
+        device_map: Optional[Dict[str, str]] = None,
+        tensor_parallel_layers: Optional[List[str]] = None,
+    ):


The enable_multi_gpu function has some inconsistencies that could confuse users:

The imports MultiGPUPipeline and enable_multi_gpu from ..distributed on line 347 are unused within this function.

The function's docstring states it supports auto, model, tensor, and data modes. However, the implementation only contains logic for mode="model". Other modes will silently do nothing.

This discrepancy between documentation and behavior can lead to unexpected results. Please consider either:

Removing the unused imports and updating the docstring to clarify that this method only handles model parallelism.

Refactoring this method to correctly handle all documented modes, possibly by delegating to the more comprehensive logic in diffsynth/distributed/multi_gpu.py.

gemini-code-assist · 2025-12-21T18:12:09Z

diffsynth/distributed/parallel.py

+        import torch_npu
+        torch.npu.set_device(local_rank)
+        _DEVICE = torch.device(f"npu:{local_rank}")


The import of torch_npu is not guarded. If device_type is 'npu' but the torch_npu package is not installed, this will raise an ImportError and crash the program. It's better to wrap this import in a try...except ImportError block to handle this case gracefully and provide a clear error message to the user.

Suggested change

import torch_npu

torch.npu.set_device(local_rank)

_DEVICE = torch.device(f"npu:{local_rank}")

try:

import torch_npu

torch.npu.set_device(local_rank)

_DEVICE = torch.device(f"npu:{local_rank}")

except ImportError:

raise ImportError("NPU device type requested, but torch_npu is not installed.")

gemini-code-assist · 2025-12-21T18:12:09Z

diffsynth/distributed/multi_gpu.py

+        best_gpu = min(
+            range(num_gpus),
+            key=lambda i: gpu_usage[i] if gpu_usage[i] + size <= max_memory_per_gpu.get(i, float('inf')) else float('inf')
+        )


The logic to find the best GPU is a bit complex and hard to read. Using float('inf') to disqualify GPUs that don't have enough memory is clever, but it makes the key function less intuitive. Consider refactoring this for clarity, for example by pre-filtering the list of available GPUs.

# Find a GPU that has enough space available_gpus = [i for i in range(num_gpus) if gpu_usage[i] + size <= max_memory_per_gpu.get(i, 0)] if not available_gpus: # No GPU has enough space, use CPU offload device_map[name] = "cpu" continue # Assign to the available GPU with the least usage best_gpu = min(available_gpus, key=lambda i: gpu_usage[i])

gemini-code-assist · 2025-12-21T18:12:10Z

diffsynth/distributed/tensor_parallel.py

+            layer.in_features % world_size == 0 and
+            layer.out_features % world_size == 0


The condition to decide whether a layer should be parallelized is too strict. It requires both in_features and out_features to be divisible by world_size.

However, for column parallelism, only out_features needs to be divisible, and for row parallelism, only in_features is required to be divisible. Since TensorParallelLinear automatically chooses the parallelism mode based on which dimension is larger, the check should be relaxed to match this logic. This will enable tensor parallelism for more layers.

Suggested change

layer.in_features % world_size == 0 and

layer.out_features % world_size == 0

((layer.out_features >= layer.in_features and layer.out_features % world_size == 0) or

(layer.out_features < layer.in_features and layer.in_features % world_size == 0))

longkeyy added 2 commits December 22, 2025 01:48

gemini-code-assist bot reviewed Dec 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add multi-GPU support with data/model/tensor parallelism #1153

feat: Add multi-GPU support with data/model/tensor parallelism #1153

longkeyy commented Dec 21, 2025

Uh oh!

gemini-code-assist bot commented Dec 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 21, 2025

Uh oh!

gemini-code-assist bot Dec 21, 2025

Uh oh!

gemini-code-assist bot Dec 21, 2025

Uh oh!

gemini-code-assist bot Dec 21, 2025

Uh oh!

gemini-code-assist bot Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		layer.in_features % world_size == 0 and
		layer.out_features % world_size == 0

feat: Add multi-GPU support with data/model/tensor parallelism #1153

Are you sure you want to change the base?

feat: Add multi-GPU support with data/model/tensor parallelism #1153

Conversation

longkeyy commented Dec 21, 2025

Summary

New Features

Usage Examples

Test Plan

Files Changed

Uh oh!

gemini-code-assist bot commented Dec 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant