Hf checkpoint conversion for distributed checkpoints #424

BlueCrescent · 2025-11-28T12:29:46Z

What does this PR do?

Implements checkpoint conversion for DCP checkpoints (FSDP2, PP, TP). For this the checkpoint first gets converted to a normal Pytorch checkpoint (together with a corresponding config) and then gets converted using the existing code.
Note: Currently, no new tests were added due to the effort of creating and manipulating a dcp checkpoint as needed for those tests.

General Changes

Added convert dcp to pytorch code.
Using this for the huggingface conversion.

Breaking Changes

None

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

…o huggingface format.

…kpoint_conversion_for_fsdp2

BlueCrescent · 2025-11-28T12:30:45Z

Should we maybe move the conversion directory into the checkpointing directory with this PR (after review).

Copilot

Pull request overview

This PR adds support for converting distributed checkpoint (DCP) formats (FSDP2, PP, TP) to HuggingFace transformers format. The conversion is implemented as a two-step process: first converting DCP checkpoints to standard PyTorch format, then using the existing conversion pipeline to create HuggingFace models.

Added new convert_dcp_to_torch module to handle DCP-to-PyTorch checkpoint conversion
Extended the GPT-2 conversion script to support DCP checkpoints via --dcp flag
Introduced ConfigDictType type alias for better type consistency across configuration handling

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
src/modalities/checkpointing/convert_dcp_to_torch.py	New module implementing DCP to PyTorch checkpoint conversion with config file transformation
src/modalities/conversion/gpt2/convert_gpt2.py	Added `--dcp` flag support, new `convert_gpt2_dcp` function, and refactored main entry point
src/modalities/conversion/gpt2/conversion_model.py	Updated type hints to use `ConfigDictType` and added dtype assertion in model checking
src/modalities/config/config.py	Added `ConfigDictType` alias, new `save_yaml_config_dict` function, and fixed implicit return in resolver
src/modalities/models/utils.py	Updated type hints to use `ConfigDictType` for consistency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/modalities/conversion/gpt2/conversion_model.py

src/modalities/checkpointing/convert_dcp_to_torch.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…env context manager.

…ints.

…ionEnum.

- Now only loading model weights into memory (no optimizer or scheduler weights). - Always creating a FP32 config since FSDP2 always has FP32 weights. - Disabled overwriting of existing config files.

…ity again. This was not required after all.

- Detection and warning if another attention implementation than Huggignface default is used since this is not saved with the checkpoint. - Correct handling and matching of FSDP2 mixed precision behavior. (In particular for rotary pos embeddings).

…llama implementation.

At this time, this bug seems to be fixed in main and we should be able to use a version >4.57.3 once it is released. Problematic line: https://github.com/huggingface/transformers/blob/47b0e478f324b54f177ea7998a0791870fdd0324/src/transformers/utils/generic.py#L947 Fixed version: https://github.com/huggingface/transformers/blob/d3ee06b8cb5e45aab51b85aafd54f4b3f7cad2e2/src/transformers/utils/generic.py#L791

…onment variables.

… config.

…l uses of that function work.

…for missing fields in the original config.

…nal.

therealdavidos

nice one!

therealdavidos · 2025-12-15T09:11:34Z

src/modalities/running_env/cuda_env.py

+        self._env_override = EnvOverride(
+            {
+                "MASTER_ADDR": "localhost",
+                "MASTER_PORT": str(rdvz_port),
+                "RANK": str(global_rank),
+                "LOCAL_RANK": str(local_rank),
+                "WORLD_SIZE": str(world_size),
+            }
+        )


Isn't this stuff typically taken care of by torchrun ? Why do we need the CudaEnv class ?

The MultiProcessingCudaEnv is useful when running distributed stuff from Python directly. Previously, we only used this for our unit tests. For conversion, it was also necessary to have this to work with the DCP model in the conversion script.

BlueCrescent added 3 commits November 27, 2025 15:11

feat(config): Added OmegaConf based serializer save_yaml_config_dict().

e6b7cab

feat(huggingface): Added conversion of distributed gpt2 checkpoints t…

9fa51ec

…o huggingface format.

chore: Merge branch 'fix_rotary_transform_deferred_init' into hf_chec…

a73de85

…kpoint_conversion_for_fsdp2

BlueCrescent requested a review from Copilot November 28, 2025 12:29

Copilot started reviewing on behalf of BlueCrescent November 28, 2025 12:30 View session

Copilot finished reviewing on behalf of BlueCrescent November 28, 2025 12:33

Copilot AI reviewed Nov 28, 2025

View reviewed changes

BlueCrescent and others added 6 commits November 28, 2025 14:22

refactor: More robust parent directory path handling.

d7d0956

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

docs: better dcp to torch conversion docstring

8957f19

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: Added handling for missing directory.

527a0d2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: use Path instead of string

95cead4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: use cpu device for dcp to torch converted checkpoints

b8cf4ea

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: error handling if wrong model key is set in checkpoint conversion

652e77a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Base automatically changed from fix_rotary_transform_deferred_init to main December 2, 2025 14:03

BlueCrescent added 15 commits December 2, 2025 15:41

feat(utility): Moved MultiProcessingCudaEnv from tests to modalities.

fca72dc

feat(utility): Added option to set init_process_group kwargs in cuda …

ace93c7

…env context manager.

feat(utility): Extended get_model_from_config for distributed checkpo…

53eb907

…ints.

feat(huggingface): Added dcp specific conversion verification logic.

3a4b46c

fix(huggingface): Better dcp config conversion.

642466d

feat(config): Added interoperability between PyTorchDtypes and Precis…

f54abc6

…ionEnum.

fix(huggingface): Correct conversion of model dtype.

3fbe498

fix(config): circular import

ee4e244

feat(checkpointing): improvements for dcp to torch checkpoint conversion

1b4cfe0

- Now only loading model weights into memory (no optimizer or scheduler weights). - Always creating a FP32 config since FSDP2 always has FP32 weights. - Disabled overwriting of existing config files.

revert(config): Removed PrecisionEnum <-> PyTorchDtypes interoperabil…

3a67ed9

…ity again. This was not required after all.

fix(model): Corrected type casting in rotary pos embeddings to match …

5a36d48

…llama implementation.

feat(utility): Added weights printing to print_forward_hook.

bce2ae1

feat(utility): Added EnvOverride utility for temporary changing envir…

f902152

…onment variables.

BlueCrescent added 5 commits December 9, 2025 17:50

fix(huggingface): Setting some environment variables when loading dcp…

d520095

… config.

fix(checkpointing): Moved EnvOverride into load_dcp_config so that al…

42a7e42

…l uses of that function work.

fix(huggingface): Made single node dcp config generation more robust …

03e07f5

…for missing fields in the original config.

test(utility): Made manager shutdown in monitor_child_processes optio…

8a9ff2f

…nal.

test(huggingface): Added unit tests for dcp to hf conversion.

9ae218d

therealdavidos approved these changes Dec 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hf checkpoint conversion for distributed checkpoints #424

Hf checkpoint conversion for distributed checkpoints #424

Uh oh!

BlueCrescent commented Nov 28, 2025

Uh oh!

BlueCrescent commented Nov 28, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

therealdavidos left a comment

Uh oh!

therealdavidos Dec 15, 2025

Uh oh!

BlueCrescent Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Hf checkpoint conversion for distributed checkpoints #424

Are you sure you want to change the base?

Hf checkpoint conversion for distributed checkpoints #424

Uh oh!

Conversation

BlueCrescent commented Nov 28, 2025

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

BlueCrescent commented Nov 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

therealdavidos left a comment

Choose a reason for hiding this comment

Uh oh!

therealdavidos Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

BlueCrescent Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants