-
Notifications
You must be signed in to change notification settings - Fork 104
feat: add context parallel support for SFT #420
Conversation
Signed-off-by: ashors1 <ashors@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>
|
Note that CP support for SFT was recently added to NeMo. We need a NeMo commit at least as recent as 8c921dc19a905d8b5a0f90f6e2a34607c2e0660d |
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
This reverts commit ea3b5ba.
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO:
- potentially rebase after https://github.com/NVIDIA/NeMo-Aligner/actions/runs/12149989736/job/33882060795?pr=405#step:5:743
- presubmit didn't pass for SFT on that PR, so if that doesn't get resolved by thursday, let's just merge this as is assuming CI passes since #405 should be able to rebase on this easily
| sync_batch_comm: False | ||
| megatron_amp_O2: False | ||
| encoder_seq_length: 4096 # the sequence length of the encoder model, it will be overwriten by loaded GPT model | ||
| transformer_engine: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education, why do we need to specify this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually think it's necessary since TE is enabled by default. I just wanted to make explicit the fact that we were using TE. But I will remove this
| pad_seq_length_to_mult = 16 | ||
| if model_cfg is not None: | ||
| pad_seq_length_to_mult = ( | ||
| 8 * model_cfg.get("tensor_model_parallel_size", 1) if model_cfg.get("sequence_parallel", False) else 16 | ||
| ) | ||
| pad_seq_length_to_mult *= model_cfg.get("context_parallel_size", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to that fp8 comment above, should this be:
| pad_seq_length_to_mult = 16 | |
| if model_cfg is not None: | |
| pad_seq_length_to_mult = ( | |
| 8 * model_cfg.get("tensor_model_parallel_size", 1) if model_cfg.get("sequence_parallel", False) else 16 | |
| ) | |
| pad_seq_length_to_mult *= model_cfg.get("context_parallel_size", 1) | |
| pad_seq_length_to_mult = 16 | |
| if model_cfg is not None: | |
| if model_cfg.get("sequence_parallel", False): | |
| pad_seq_length_to_mult = math.lcm(pad_seq_length_to_mult, model_cfg.get("tensor_model_parallel_size", 1)) | |
| pad_seq_length_to_mult *= model_cfg.get("context_parallel_size", 1) |
? From the comment it sounds like if someone is doing fp8 SFT with TP=1 and set sequence_parallel, then the padding would be too small
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That chunk was taken directly from here: https://github.com/NVIDIA/NeMo/blob/b847bf75c371931e4f17ea426741c1d023afa0c0/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py#L262-L268, but the code does seem to contradict the comment. I'll follow up with the TE team
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the TE team: "when SP=True, the dimensions are flipped so the sequence dimension is first. So we only need to make sure it's divisible by 8 after the TP split to comply with TE's expectations."
|
|
||
| fwd_bwd_function = get_forward_backward_func() | ||
| fwd_loss_fn = self.get_forward_output_and_loss_func(forward_only) | ||
| fwd_loss_fn = self.get_forward_output_and_loss_func(forward_only, tuning=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does tuning do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It controls the keys that are returned in the batch: https://github.com/NVIDIA/NeMo/blob/b847bf75c371931e4f17ea426741c1d023afa0c0/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py#L1211-L1222. If tuning=False, we don't return the keys that are necessary for sequence packing (and thus CP) in TE.
Note also that tuning is set to True in NeMo's SFT: https://github.com/NVIDIA/NeMo/blob/b847bf75c371931e4f17ea426741c1d023afa0c0/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py#L407
|
closing in favor of #430. @terrykong I'll address your comments there |
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Checklist when contributing a new algorithm
max_steps=-1andvalidation?Additional Information