Multi stage pipeline parallelism support #418

BlueCrescent · 2025-11-07T18:09:15Z

What does this PR do?

Adds support for multi stage pipeline parallelism schedules, in particular interleaved 1F1B.
Issue #408

General Changes

Made code compatible with having multiple stages per rank.
Switched to interleaved 1F1B in some configs.
Note: In warmstart test, drastically increased epsilon for loss comparison.

Breaking Changes

Changes should be backwards compatible.

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

…elism support. (WIP)

…odel. Also made None returns more visible in get_module_class_from_name().

…ith interleaved 1F1B.

…tack traces/views).

- Switched from using abs=1e-16 to rel=1e-2 for loss comparisons. Need to investigate further, why this is necessary for some configurations. - Additional configs and test setups which are however commented out due to the long runtime of these tests. - Easier configurability for expected checkpoint paths (for debugging/messing around). - Better error logging.

BlueCrescent · 2025-11-21T15:31:26Z

src/modalities/checkpointing/stateful/app_state.py

+        else:
+            assert len(app_state.model_parts) == 1, "Expected a single model part for non-OptimizersList optimizer."
+            sd = get_optimizer_state_dict(
+                model=app_state.model_parts[0],
+                optimizers=app_state.optimizer,
+                # NOTE: Flattening is required for pipeline parallelism to work correctly.
+                # see https://github.com/pytorch/torchtitan/blob/b291ad662493b63d25b038a30a915082d3617baf/torchtitan/components/checkpoint.py#L193-L214
+                options=StateDictOptions(flatten_optimizer_state_dict=True),
+            )


Should we remove this, since in case of PP we now always have an optimizer list which takes care of the flattening?

BlueCrescent · 2025-11-24T09:46:16Z

src/modalities/training/gradient_clipping/fsdp_gradient_clipper_config.py

+    @model_validator(mode="before")
+    @classmethod
+    def warn_deprecated_alias(cls, data: Any) -> Any:
+        if isinstance(data, dict) and "wrapped_model" in data:
+            warnings.warn(
+                "Field 'wrapped_model' is deprecated. Use 'wrapped_model_or_parts' instead.",
+                DeprecationWarning,
+                stacklevel=3,
+            )
+        return data
+


Should we use this deprecation warning? If yes, should we use it also in other configs where a field got renamed to plural?

BlueCrescent · 2025-11-24T09:50:40Z

tests/end2end_tests/test_fsdp2_warmstart_pp_tp.py

+            # ("gpt2_train_num_steps_7_pp.yaml", "gpt2_warm_start_from_step_4_fsdp2.yaml", 4, 2),
+            # ("gpt2_train_num_steps_7_pp.yaml", "gpt2_warm_start_from_step_4_fsdp2_grad_accu.yaml", 4, 2),
+            # ("gpt2_train_num_steps_7_tp.yaml", "gpt2_warm_start_from_step_4_fsdp2.yaml", 4, 2),


These are currently deactivated due to the long runtime of these tests. Should we activate them anyways?

The first and the third commented-out configs are the same, right?

I don't think that
("gpt2_train_num_steps_7_pp.yaml", "gpt2_warm_start_from_step_4_fsdp2.yaml", 4, 2),
is necessary since we already test
("gpt2_train_num_steps_7_pp_tp.yaml", "gpt2_warm_start_from_step_4_fsdp2.yaml", 8, 2),
which is the same setup + data parallelism, correct?

And since we have
("gpt2_train_num_steps_7_pp_tp.yaml", "gpt2_warm_start_from_step_4_grad_accu.yaml", 8, 1),
we can probably skip
# ("gpt2_train_num_steps_7_pp.yaml", "gpt2_warm_start_from_step_4_fsdp2_grad_accu.yaml", 4, 2),

Yeah, these configs are mostly useful for debugging with fewer ranks. Probably makes sense to have them turned off (or even delete them in the future).

BlueCrescent · 2025-11-24T10:00:35Z

src/modalities/models/parallelism/stages_generator.py

+            (  # FIXME wpe and drop probably should not get the higher weight
+                ["transformer.wte", "transformer.wpe", "transformer.drop"],
+                self._input_layer_equivalence,
+            ),


I added this FIXME, anyone got an opinion on whether I can remove wpe and drop from this list?

rrutmann

There are some tests failing for me:

/workspaces/modalities/tests/conversion/gpt2/test_conversion_model.py::test_convert_model_checkpoint_produces_same_logits_as_original[gpt2_config_test.yaml-False]
TypeError: check_model_inputs..wrapped_fn() got an unexpected keyword argument 'input_ids'

/workspaces/modalities/tests/conversion/gpt2/test_convert_gpt2.py::test_converting_gpt2_does_not_change_outputs[gpt2_config_test.yaml-False]
TypeError: check_model_inputs..wrapped_fn() got an unexpected keyword argument 'input_ids'

/workspaces/modalities/tests/fsdp2_parallelization/test_tensor_parallelism.py::TestTensorParallelism::test_tp_sharding[swiglu-fsdp2_config_path1-tp_config_path1]
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGABRT

As well as an error importing one of the tests:
______ ERROR collecting tests/checkpointing/test_checkpoint_conversion.py ______
tests/checkpointing/test_checkpoint_conversion.py:59: in
@pytest.mark.skipif(
/home/richard-rutmann/.local/lib/python3.11/site-packages/_pytest/mark/structures.py:401: in call
store_mark(unwrapped_func, self.mark, stacklevel=3)
/home/richard-rutmann/.local/lib/python3.11/site-packages/_pytest/mark/structures.py:466: in store_mark
warnings.warn(MARKED_FIXTURE, stacklevel=stacklevel)
E pytest.PytestRemovedIn9Warning: Marks applied to fixtures have no effect
E See docs: https://docs.pytest.org/en/stable/deprecations.html#applying-a-mark-to-a-fixture-function

src/modalities/models/parallelism/pipeline_parallelism.py

src/modalities/registry/components.py

src/modalities/evaluator.py

rrutmann · 2025-11-27T14:28:33Z

tests/end2end_tests/test_fsdp2_warmstart_pp_tp.py

+            # ("gpt2_train_num_steps_7_pp.yaml", "gpt2_warm_start_from_step_4_fsdp2.yaml", 4, 2),
+            # ("gpt2_train_num_steps_7_pp.yaml", "gpt2_warm_start_from_step_4_fsdp2_grad_accu.yaml", 4, 2),
+            # ("gpt2_train_num_steps_7_tp.yaml", "gpt2_warm_start_from_step_4_fsdp2.yaml", 4, 2),


The first and the third commented-out configs are the same, right?

rrutmann · 2025-11-27T14:31:34Z

tests/end2end_tests/test_fsdp2_warmstart_pp_tp.py

+            # ("gpt2_train_num_steps_7_pp.yaml", "gpt2_warm_start_from_step_4_fsdp2.yaml", 4, 2),
+            # ("gpt2_train_num_steps_7_pp.yaml", "gpt2_warm_start_from_step_4_fsdp2_grad_accu.yaml", 4, 2),
+            # ("gpt2_train_num_steps_7_tp.yaml", "gpt2_warm_start_from_step_4_fsdp2.yaml", 4, 2),


I don't think that
("gpt2_train_num_steps_7_pp.yaml", "gpt2_warm_start_from_step_4_fsdp2.yaml", 4, 2),
is necessary since we already test
("gpt2_train_num_steps_7_pp_tp.yaml", "gpt2_warm_start_from_step_4_fsdp2.yaml", 8, 2),
which is the same setup + data parallelism, correct?

And since we have
("gpt2_train_num_steps_7_pp_tp.yaml", "gpt2_warm_start_from_step_4_grad_accu.yaml", 8, 1),
we can probably skip
# ("gpt2_train_num_steps_7_pp.yaml", "gpt2_warm_start_from_step_4_fsdp2_grad_accu.yaml", 4, 2),

rrutmann

Great work, thank you. A few tests are failing (see my comment), but aside from that, no major changes required from my side

Also enabled extra="forbid" in BaseModel to prevent accidental extra fields.

Note: Only strings are supported, not more complex path aliases.

…ecated all aliases created due to multi stage pp.

Co-authored-by: Richard Rutmann <97447451+rrutmann@users.noreply.github.com>

…s in code base. Also added missing deprecation marker for GPT2MFUCalculatorConfig.

BlueCrescent added 9 commits November 7, 2025 18:35

fix(config): Component factory assert accepts BaseModel field aliases.

919c6bd

feat(parallelism): Added first version of multi stage pipeline parall…

68b02aa

…elism support. (WIP)

chore: Merge remote-tracking branch 'origin/main' into pp_multi_stage

c9677b1

feat(parallelism): Switched various pp configs to interleaved 1F1B.

f885ca8

fix(parallelism): Handling for None block types when fsdp2 wrapping m…

a915805

…odel. Also made None returns more visible in get_module_class_from_name().

fix(parallelism): Deactivated eval before training again due to bug w…

bc7089b

…ith interleaved 1F1B.

refactor: Better name for maybe_list_parameter_wrapper (relevant in s…

07ef847

…tack traces/views).

feat: Better error reporting in cuda env tear down.

7f0518b

BlueCrescent commented Nov 21, 2025

View reviewed changes

refactor(typing): Removed unused PydanticPytorchModuleNotListType.

56660fb

BlueCrescent marked this pull request as ready for review November 24, 2025 09:35

BlueCrescent commented Nov 24, 2025

View reviewed changes

rrutmann self-requested a review November 25, 2025 13:37

rrutmann reviewed Nov 27, 2025

View reviewed changes

rrutmann requested changes Nov 27, 2025

View reviewed changes

BlueCrescent and others added 11 commits November 28, 2025 16:02

fix(logging): correct value for num_total_stages

d24c7bc

docs(optimizer): Class docstring for OptimizersList.

534bf5a

fix(config): Correct model validation call.

30bb0ac

Also enabled extra="forbid" in BaseModel to prevent accidental extra fields.

fix(config): Included validation aliases in parameter checks.

8aa7302

Note: Only strings are supported, not more complex path aliases.

feat(utility): Added better way for config alias deprecation and depr…

aaafde3

…ecated all aliases created due to multi stage pp.

refactor(config): missing type hints

0e527a9

refactor: duplicate if statement

a8913e0

Co-authored-by: Richard Rutmann <97447451+rrutmann@users.noreply.github.com>

refactor: Unified use of model_parts instead of wrapped_model_or_part…

92a0c2c

…s in code base. Also added missing deprecation marker for GPT2MFUCalculatorConfig.

test(huggingface): removed skip mark from fixture (no effect)

ad9dfb3

chore: Merge remote-tracking branch 'origin/main' into pp_multi_stage

6080f19

feat(utility): model parts support for debugging components

b88a060

le1nux self-requested a review December 10, 2025 09:34

Multi stage pipeline parallelism support #418

Are you sure you want to change the base?

Multi stage pipeline parallelism support #418

Uh oh!

Conversation

BlueCrescent commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rrutmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rrutmann left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BlueCrescent commented Nov 7, 2025 •

edited

Loading

BlueCrescent Dec 1, 2025 •

edited

Loading