Skip to content

Conversation

@romerojosh
Copy link
Collaborator

This PR fixes an issue in handling the optional extra_loss_args arguments to the TorchFort supervised training function. There was a missing call to the reset routine, which is needed to preserve references to the user data in cases where the TorchFort backend has to migrate the user data from CPU to GPU or GPU to CPU for a training step. Without this reset, user changes to the arrays passed to torchfort_tensor_list_add_tensor for the extra loss arguments list would not propagate after the first training step. Workloads where the model and extra loss args data are already present on the same device are not impacted by this.

I've adjusted the supervised training tests to better cover this scenario.

@romerojosh
Copy link
Collaborator Author

/build_and_test

@github-actions
Copy link

🚀 Build workflow triggered! View run

@github-actions
Copy link

❌ Build workflow failed! View run

…e to user data.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh
Copy link
Collaborator Author

/build_and_test

@github-actions
Copy link

🚀 Build workflow triggered! View run

@github-actions
Copy link

✅ Build workflow passed! View run

@romerojosh romerojosh merged commit 35c26d9 into master Dec 2, 2025
4 checks passed
@romerojosh romerojosh deleted the extra_loss_arg_reset branch December 5, 2025 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants