Skip to content

Conversation

@BlueCrescent
Copy link
Collaborator

What does this PR do?

Implements checkpoint conversion for DCP checkpoints (FSDP2, PP, TP). For this the checkpoint first gets converted to a normal Pytorch checkpoint (together with a corresponding config) and then gets converted using the existing code.
Note: Currently, no new tests were added due to the effort of creating and manipulating a dcp checkpoint as needed for those tests.

General Changes

  • Added convert dcp to pytorch code.
  • Using this for the huggingface conversion.

Breaking Changes

  • None

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@BlueCrescent
Copy link
Collaborator Author

Should we maybe move the conversion directory into the checkpointing directory with this PR (after review).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for converting distributed checkpoint (DCP) formats (FSDP2, PP, TP) to HuggingFace transformers format. The conversion is implemented as a two-step process: first converting DCP checkpoints to standard PyTorch format, then using the existing conversion pipeline to create HuggingFace models.

  • Added new convert_dcp_to_torch module to handle DCP-to-PyTorch checkpoint conversion
  • Extended the GPT-2 conversion script to support DCP checkpoints via --dcp flag
  • Introduced ConfigDictType type alias for better type consistency across configuration handling

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/modalities/checkpointing/convert_dcp_to_torch.py New module implementing DCP to PyTorch checkpoint conversion with config file transformation
src/modalities/conversion/gpt2/convert_gpt2.py Added --dcp flag support, new convert_gpt2_dcp function, and refactored main entry point
src/modalities/conversion/gpt2/conversion_model.py Updated type hints to use ConfigDictType and added dtype assertion in model checking
src/modalities/config/config.py Added ConfigDictType alias, new save_yaml_config_dict function, and fixed implicit return in resolver
src/modalities/models/utils.py Updated type hints to use ConfigDictType for consistency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

BlueCrescent and others added 6 commits November 28, 2025 14:22
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Base automatically changed from fix_rotary_transform_deferred_init to main December 2, 2025 14:03
- Now only loading model weights into memory (no optimizer or scheduler weights).
- Always creating a FP32 config since FSDP2 always has FP32 weights.
- Disabled overwriting of existing config files.
- Detection and warning if another attention implementation than Huggignface default is used since this is not saved with the checkpoint.
- Correct handling and matching of FSDP2 mixed precision behavior. (In particular for rotary pos embeddings).
Copy link
Collaborator

@therealdavidos therealdavidos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one!

Comment on lines +81 to +89
self._env_override = EnvOverride(
{
"MASTER_ADDR": "localhost",
"MASTER_PORT": str(rdvz_port),
"RANK": str(global_rank),
"LOCAL_RANK": str(local_rank),
"WORLD_SIZE": str(world_size),
}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this stuff typically taken care of by torchrun ? Why do we need the CudaEnv class ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MultiProcessingCudaEnv is useful when running distributed stuff from Python directly. Previously, we only used this for our unit tests. For conversion, it was also necessary to have this to work with the DCP model in the conversion script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants