Skip to content

Conversation

@savitha-eng
Copy link
Collaborator

@savitha-eng savitha-eng commented Nov 26, 2025

Add Phylogenetic Tag Masking Support (for training with full opengenome2 pretraining dataset)

Summary

Extends the genomic data collator to support masking phylogenetic taxonomy tags for training on the full OpenGenome2 dataset. Uses Evo2's phylogenetic tag detection algorithm for sequences containing taxonomy annotations.

Description

This PR adds support for masking phylogenetic taxonomy tags that appear in OpenGenome2's full pretraining dataset. These tags have the format |d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria| and should not be predicted by the model since they are metadata annotations, not DNA sequence.

The implementation is the same as NeMo's Evo2 dataset and uses a state machine to detect taxonomy patterns between pipe delimiters. The masking is optional and disabled by default (only needed for full dataset, not metagenomics-only training).

Key changes:

  • Add mask_phylogenetic_tags() function (160 lines from Evo2 implementation)
  • Add mask_phylo_tags parameter to GenomicDataCollator (default: False)
  • Fix masking order to: phylo → degenerate → uppercase (critical for correct detection)
  • Add tests for phylo masking cases (some tests also from Evo2 implementation)

Backward compatibility:

  • Default behavior unchanged (phylo masking disabled)
  • Existing metagenomics configs work without modification
  • Only affects training when explicitly enabled

Usage

Enable phylogenetic tag masking for full OpenGenome2 pretraining:

from dataset import create_bshd_dataloader

# For full dataset with taxonomy annotations:
dataloader, _ = create_bshd_dataloader(
    distributed_config=config,
    tokenizer_path="./example_checkpoint",
    load_dataset_kwargs={
        "path": "arcinstitute/opengenome2",
        "split": "train",
        "streaming": True,
    },
    micro_batch_size=4,
    mask_degenerate_bases=True,
    mask_phylo_tags=True,  # Enable for full dataset
    uppercase_labels=False,
)

Or via Hydra config:

dataset:
  mask_phylo_tags: true  # Enable for full pretraining
  mask_degenerate_bases: true
  uppercase_labels: false

Order of operations:

  1. Phylo masking (needs pipes and lowercase to detect tags)
  2. Degenerate masking (masks non-ACGT including pipes)
  3. Uppercase (optional, after detection)

This order is critical - phylo detection relies on pipes (|) and lowercase letters that would be removed by degenerate/uppercase masking.


Type of changes

  • New feature (non-breaking change which adds functionality)

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly (docstrings in code)
  • I have added/updated tests as needed
  • All existing tests pass successfully (backward compatible, all tests passing)

Core Implementation:
- Add mask_phylogenetic_tags() to genomic_masking_functions.py
  - State machine from Evo2 for detecting taxonomy tags
  - Format: |d__Bacteria;p__Proteobacteria;...|
  - Draws upon Evo2 NeMo implementation exactly
- Update GenomicDataCollator with mask_phylo_tags parameter

Features (Milestone 2):
- mask_phylo_tags: Default False (metagenomics doesn't need it)
- Enable for full OpenGenome2 pretraining (has phylo tags)
- Backward compatible: Existing code unaffected

Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Add test_dataloader_with_phylo_masking:
- Tests full pipeline (dataloader + phylo collator)
- Verifies phylo tags are masked in batches
- Uses realistic sequences with taxonomy annotations

Signed-off-by: savitha-eng <savithas@nvidia.com>
@savitha-eng savitha-eng marked this pull request as ready for review November 26, 2025 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants