Savitha/add phylo masking to llama3 #1351
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add Phylogenetic Tag Masking Support (for training with full opengenome2 pretraining dataset)
Summary
Extends the genomic data collator to support masking phylogenetic taxonomy tags for training on the full OpenGenome2 dataset. Uses Evo2's phylogenetic tag detection algorithm for sequences containing taxonomy annotations.
Description
This PR adds support for masking phylogenetic taxonomy tags that appear in OpenGenome2's full pretraining dataset. These tags have the format
|d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria|and should not be predicted by the model since they are metadata annotations, not DNA sequence.The implementation is the same as NeMo's Evo2 dataset and uses a state machine to detect taxonomy patterns between pipe delimiters. The masking is optional and disabled by default (only needed for full dataset, not metagenomics-only training).
Key changes:
mask_phylogenetic_tags()function (160 lines from Evo2 implementation)mask_phylo_tagsparameter to GenomicDataCollator (default: False)Backward compatibility:
Usage
Enable phylogenetic tag masking for full OpenGenome2 pretraining:
Or via Hydra config:
Order of operations:
This order is critical - phylo detection relies on pipes (
|) and lowercase letters that would be removed by degenerate/uppercase masking.Type of changes
Pre-submit Checklist