Skip to content

Conversation

@arrdel
Copy link
Contributor

@arrdel arrdel commented Dec 6, 2025

Fixes #42661

Problem

When using TP with user-provided parallelism_config, the Trainer overwrites the entire config object, discarding user settings.

Solution

  • Update only tp_size attribute when config is provided
  • Create new config only when none is provided
  • Preserve all user-provided parallelism settings

Changes

  • Fixed logic in trainer.py lines 5074-5087
  • Removed redundant nested checks
  • User configs (dp_size, pp_size, cp_backend, etc.) now preserved

Fixes huggingface#42661

When using TP with a user-provided parallelism_config, the Trainer was
incorrectly overwriting the entire config object with a new
ParallelismConfig(tp_size=model.tp_size), discarding all user-provided
settings (dp_size, pp_size, cp_backend, etc.).

Changes:
- If user provides parallelism_config, update only the tp_size attribute
- If no config is provided, create a new ParallelismConfig with tp_size
- Removed redundant nested condition check
- Fixed logic flow to check accelerate version first

This ensures user-provided parallelism configurations are preserved
during TP-only training.
Copilot AI review requested due to automatic review settings December 6, 2025 02:03
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where the Trainer was discarding user-provided parallelism_config settings when Tensor Parallelism (TP) was enabled. Previously, the entire config object was being overwritten; now only the tp_size attribute is updated while preserving other user settings.

Key changes:

  • Fixed conditional logic to update tp_size in existing user configs instead of creating a new config
  • Restructured version checks to be more clear and avoid redundant nesting

Comment on lines +5078 to +5085
if self.args.parallelism_config is not None:
# Update tp_size in user-provided config instead of overwriting it
self.args.parallelism_config.tp_size = self.model.tp_size
else:
raise ValueError("Requires accelerate>1.10.1 to use Tensor Parallelism.")
# Only create new config if user didn't provide one
from accelerate import ParallelismConfig

args["parallelism_config"] = ParallelismConfig(tp_size=self.model.tp_size)
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug fix should have test coverage to ensure that user-provided parallelism_config objects are preserved when TP is enabled. Consider adding a test that:

  1. Creates a ParallelismConfig with custom settings (e.g., dp_size, cp_size)
  2. Passes it to TrainingArguments along with a model that has tp_size > 1
  3. Verifies that after trainer initialization, the parallelism_config still has the original custom settings AND the updated tp_size

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in processing of self.args.parallelism_config inside Trainer

1 participant