Skip to content

Conversation

@patrocinio
Copy link
Contributor

@patrocinio patrocinio commented Dec 8, 2025

Move checkpoint_future.result() before optimizer.step() to ensure the previous checkpoint completes before weights are modified in-place. This allows better overlap of checkpointing with forward/backward passes.

Fixes #3584

Description

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

cc @wconstab @osalpekar @H-Huang @kwen2501

Move checkpoint_future.result() before optimizer.step() to ensure
the previous checkpoint completes before weights are modified in-place.
This allows better overlap of checkpointing with forward/backward passes.

Fixes pytorch#3584
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3688

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f914e75 with merge base 7f8b6dc (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feedback about Asynchronous Saving with Distributed Checkpoint (DCP)

2 participants