Fix async checkpoint timing in DCP recipe #3688

patrocinio · 2025-12-08T23:12:06Z

Move checkpoint_future.result() before optimizer.step() to ensure the previous checkpoint completes before weights are modified in-place. This allows better overlap of checkpointing with forward/backward passes.

Fixes #3584

Description

Checklist

The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
Only one issue is addressed in this pull request
Labels from the issue that this PR is fixing are added to this pull request
No unnecessary issues are included into this pull request.

cc @wconstab @osalpekar @H-Huang @kwen2501

Move checkpoint_future.result() before optimizer.step() to ensure the previous checkpoint completes before weights are modified in-place. This allows better overlap of checkpointing with forward/backward passes. Fixes pytorch#3584

pytorch-bot · 2025-12-08T23:12:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3688

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f914e75 with merge base 7f8b6dc ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fix async checkpoint timing in DCP recipe

f914e75

Move checkpoint_future.result() before optimizer.step() to ensure the previous checkpoint completes before weights are modified in-place. This allows better overlap of checkpointing with forward/backward passes. Fixes pytorch#3584

meta-cla bot added the cla signed label Dec 8, 2025

svekars added the distributed label Dec 8, 2025

svekars requested review from LucasLLC and wz337 December 9, 2025 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix async checkpoint timing in DCP recipe #3688

Fix async checkpoint timing in DCP recipe #3688

Uh oh!

patrocinio commented Dec 8, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix async checkpoint timing in DCP recipe #3688

Are you sure you want to change the base?

Fix async checkpoint timing in DCP recipe #3688

Uh oh!

Conversation

patrocinio commented Dec 8, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

pytorch-bot bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3688

✅ No Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

patrocinio commented Dec 8, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 8, 2025 •

edited

Loading