-
Notifications
You must be signed in to change notification settings - Fork 108
Description
Problem & Motivation
Recently issues were discovered in megatron inference related to tensor parallelism with sequence_parallel=True, which is typically the recommended way to run when using --tensor-parallel-size=N with N>1 when requesting materialize_only_last_token_logits=True.
First off we do not have an argument for allowing --sequence-parallel so we should add that to infer.py as an option to for testing since at least at train time it is a boost in parallel efficiency for tensor parallelism.
However at inference time this parameter may cause problems when materialize_only_last_token_logits=True. That case of materialize_only_last_token_logits=True seems to be recently set as the default, rather than False in megatron.
Given how this may impact accuracy, and may require us to make a change to https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/llm/gpt/model/megatron/hyena/hyena_model.py#L382-L389 similar to https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/models/gpt/gpt_model.py#L581-L596, we should have multi-gpu test coverage for --tensor-parallel-size=2 as well as --sequence-parallel once we add that to infer.py.
BioNeMo Framework Version
Category
Model/Training
Proposed Solution
Add test coverage for multi-gpu generation. It should cover tp=2, cp=2, and pp=2 so we have documentation/knowledge of which kinds of parallelism we support. Use one of the inference accuracy tests in test_evo2.py and make sure that we do not degrade accuracy, for example maybe test_batch_generate.
Expected Benefits
Knowledge of when upstream changes break inference at multi-gpu scales.