Skip to content

Conversation

@longkeyy
Copy link

Summary

This PR adds comprehensive multi-GPU support for DiffSynth-Studio pipelines, enabling efficient distributed inference across multiple GPUs.

New Features

New distributed module (diffsynth/distributed/):

File Description
parallel.py Basic distributed utilities (init, barrier, broadcast, all_reduce, all_gather)
tensor_parallel.py Column/Row parallel linear layers for tensor parallelism
data_parallel.py Data parallel utilities for batch processing
multi_gpu.py High-level multi-GPU pipeline wrapper

Three parallelism modes supported:

  1. Model Parallel: Distribute different models to different GPUs

    • Automatically balances load based on model size
    • Custom device map support for fine-grained control
  2. Data Parallel: Same model on all GPUs, process different batches

    • Scatter/gather utilities for batch distribution
    • Compatible with torchrun launcher
  3. Tensor Parallel: Split large layers across GPUs

    • Column-parallel and row-parallel linear layers
    • Automatic layer selection based on feature size

New methods in BasePipeline:

  • enable_multi_gpu(mode, device_map, tensor_parallel_layers)
  • get_model_distribution()
  • print_model_distribution()

Usage Examples

# Model Parallel - auto distribution
pipe = QwenImagePipeline.from_pretrained(...)
pipe.enable_multi_gpu(mode="model")

# Model Parallel - custom device map
pipe.enable_multi_gpu(
    mode="model",
    device_map={
        "dit": "cuda:0",
        "text_encoder": "cuda:1",
        "vae": "cuda:1",
    }
)

# Check distribution
pipe.print_model_distribution()

For Data Parallel and Tensor Parallel, use with torchrun:

torchrun --nproc_per_node=2 script.py --mode data

Test Plan

  • Model parallel mode tested with 2 GPUs
  • Custom device map functionality verified
  • Backward compatible - no changes to existing single-GPU workflows
  • Data parallel mode with torchrun
  • Tensor parallel mode with torchrun

Files Changed

  • diffsynth/__init__.py - Export distributed module
  • diffsynth/diffusion/base_pipeline.py - Add multi-GPU methods
  • diffsynth/distributed/ - New distributed module (5 files)
  • examples/multi_gpu_inference.py - Usage example

This PR fixes several issues that prevent DiffSynth-Studio from running
on non-CUDA devices (Apple Silicon MPS and CPU):

1. base_pipeline.py: Check if empty_cache exists before calling it
   - Only CUDA has torch.cuda.empty_cache()
   - MPS and CPU don't have this method

2. siglip2_image_encoder.py: Remove hardcoded device="cuda" default
   - Now auto-detects device from model parameters
   - Falls back to specified device if provided

3. dinov3_image_encoder.py: Remove hardcoded device="cuda" default
   - Same fix as siglip2_image_encoder.py

4. vram/layers.py: Check if mem_get_info exists before calling it
   - Only CUDA and NPU have mem_get_info()
   - For MPS/CPU, assume enough memory is available

These changes enable running Qwen-Image pipelines on Apple Silicon Macs
and CPU-only machines without requiring any monkey-patching workarounds.
This commit adds comprehensive multi-GPU support for DiffSynth-Studio
pipelines, enabling efficient distributed inference across multiple GPUs.

New distributed module (diffsynth/distributed/):
- parallel.py: Basic distributed utilities (init, barrier, broadcast, etc.)
- tensor_parallel.py: Column/Row parallel linear layers for tensor parallelism
- data_parallel.py: Data parallel utilities for batch processing
- multi_gpu.py: High-level multi-GPU pipeline wrapper

Three parallelism modes supported:
1. Model Parallel: Distribute different models to different GPUs
   - Automatically balances load based on model size
   - Custom device map support for fine-grained control

2. Data Parallel: Same model on all GPUs, process different batches
   - Scatter/gather utilities for batch distribution
   - Compatible with torchrun launcher

3. Tensor Parallel: Split large layers across GPUs
   - Column-parallel and row-parallel linear layers
   - Automatic layer selection based on feature size

New methods in BasePipeline:
- enable_multi_gpu(mode, device_map, tensor_parallel_layers)
- get_model_distribution()
- print_model_distribution()

Example script: examples/multi_gpu_inference.py
- Demonstrates all three parallelism modes
- Includes custom device map example
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @longkeyy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances DiffSynth-Studio by integrating robust multi-GPU capabilities. It provides flexible strategies for distributed inference, allowing users to leverage multiple GPUs for improved performance and scalability. The new distributed module supports model, data, and tensor parallelism, catering to various use cases from fitting large models into memory to accelerating batch processing. This change aims to make the framework more efficient and accessible for high-performance computing environments.

Highlights

  • Comprehensive Multi-GPU Support: Introduces a new diffsynth/distributed/ module to enable efficient distributed inference across multiple GPUs for DiffSynth-Studio pipelines.
  • Three Parallelism Modes: Implements support for Model Parallelism (distributing different models to different GPUs), Data Parallelism (processing different batches with the same model on all GPUs), and Tensor Parallelism (splitting large layers across GPUs).
  • Enhanced Pipeline API: Adds enable_multi_gpu, get_model_distribution, and print_model_distribution methods to BasePipeline for easy configuration and monitoring of multi-GPU setups.
  • Automatic Device Mapping: Includes functionality to automatically balance model components across GPUs based on their size, alongside support for fine-grained custom device maps.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive multi-GPU support, a significant and well-structured feature. The new distributed module provides clean abstractions for model, data, and tensor parallelism. The changes to existing code to accommodate multi-GPU environments are also thoughtful. My review focuses on a few key areas to enhance robustness and clarity: a critical bug in the data parallelism logic that could cause deadlocks, a couple of high-severity issues related to an unguarded import and an incomplete API, and a medium-severity issue in the tensor parallelism logic that limits its applicability. Addressing these points will make this excellent contribution even more solid.

Comment on lines +115 to +117
if rank == dst:
return {k: gather_outputs(v, dim, dst) for k, v in output.items()}
return output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical bug in the gather_outputs function when handling dictionaries. The recursive call to gather_outputs for the dictionary values is only performed on the destination rank (dst). Since gather_outputs may use collective communication operations (like all_gather), which must be called by all processes, this will cause a deadlock when ranks other than dst do not participate.

To fix this, the recursive call must be executed on all ranks.

Suggested change
if rank == dst:
return {k: gather_outputs(v, dim, dst) for k, v in output.items()}
return output
# Recursively gather each value in the dictionary.
# This must be executed on all ranks to avoid deadlocks in collective calls.
return {k: gather_outputs(v, dim, dst) for k, v in output.items()}

Comment on lines +325 to +330
def enable_multi_gpu(
self,
mode: str = "auto",
device_map: Optional[Dict[str, str]] = None,
tensor_parallel_layers: Optional[List[str]] = None,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The enable_multi_gpu function has some inconsistencies that could confuse users:

  1. The imports MultiGPUPipeline and enable_multi_gpu from ..distributed on line 347 are unused within this function.
  2. The function's docstring states it supports auto, model, tensor, and data modes. However, the implementation only contains logic for mode="model". Other modes will silently do nothing.

This discrepancy between documentation and behavior can lead to unexpected results. Please consider either:

  • Removing the unused imports and updating the docstring to clarify that this method only handles model parallelism.
  • Refactoring this method to correctly handle all documented modes, possibly by delegating to the more comprehensive logic in diffsynth/distributed/multi_gpu.py.

Comment on lines +101 to +103
import torch_npu
torch.npu.set_device(local_rank)
_DEVICE = torch.device(f"npu:{local_rank}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The import of torch_npu is not guarded. If device_type is 'npu' but the torch_npu package is not installed, this will raise an ImportError and crash the program. It's better to wrap this import in a try...except ImportError block to handle this case gracefully and provide a clear error message to the user.

Suggested change
import torch_npu
torch.npu.set_device(local_rank)
_DEVICE = torch.device(f"npu:{local_rank}")
try:
import torch_npu
torch.npu.set_device(local_rank)
_DEVICE = torch.device(f"npu:{local_rank}")
except ImportError:
raise ImportError("NPU device type requested, but torch_npu is not installed.")

Comment on lines +120 to +123
best_gpu = min(
range(num_gpus),
key=lambda i: gpu_usage[i] if gpu_usage[i] + size <= max_memory_per_gpu.get(i, float('inf')) else float('inf')
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to find the best GPU is a bit complex and hard to read. Using float('inf') to disqualify GPUs that don't have enough memory is clever, but it makes the key function less intuitive. Consider refactoring this for clarity, for example by pre-filtering the list of available GPUs.

        # Find a GPU that has enough space
        available_gpus = [i for i in range(num_gpus) if gpu_usage[i] + size <= max_memory_per_gpu.get(i, 0)]

        if not available_gpus:
            # No GPU has enough space, use CPU offload
            device_map[name] = "cpu"
            continue

        # Assign to the available GPU with the least usage
        best_gpu = min(available_gpus, key=lambda i: gpu_usage[i])

Comment on lines +495 to +496
layer.in_features % world_size == 0 and
layer.out_features % world_size == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition to decide whether a layer should be parallelized is too strict. It requires both in_features and out_features to be divisible by world_size.

However, for column parallelism, only out_features needs to be divisible, and for row parallelism, only in_features is required to be divisible. Since TensorParallelLinear automatically chooses the parallelism mode based on which dimension is larger, the check should be relaxed to match this logic. This will enable tensor parallelism for more layers.

Suggested change
layer.in_features % world_size == 0 and
layer.out_features % world_size == 0
((layer.out_features >= layer.in_features and layer.out_features % world_size == 0) or
(layer.out_features < layer.in_features and layer.in_features % world_size == 0))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant