A Transformer-based AI system designed to decode the regulatory logic of the Drosophila (Fruit Fly) genome.
This repository includes a CI/CD pipeline that ensures the reproducibility of the experiment. Every commit automatically:
- Installs the specific PyTorch/NumPy environment.
- Downloads the Drosophila genome from UCSC.
- Trains the Transformer model on a subset of data.
- Publishes the trained model weights as an artifact.
This project treats the cell not as a biological entity, but as a Computational Operating System.
By applying Natural Language Processing (NLP) techniques—specifically Transformers and K-Mer Tokenization—to raw DNA sequences, this system aims to predict gene expression levels directly from the sequence, effectively "reading" the regulatory code of life.
Core Concept:
- Hardware: Chromatin Structure (Nucleosomes)
- Software: DNA Sequence (Promoters/Enhancers)
- Compiler: Neural Network (Genomic Decoder)
The system handles massive genomic files using a Lazy-Loading strategy to run efficiently on consumer hardware (MacBook Air), despite the dataset size (23M+ base pairs).
genomic-decoder-fly/
├── data/raw/ # 🛑 Stores dm6.fa (Lazy Loaded)
├── src/
│ ├── dataloader.py # 🧬 Biopython-based O(1) Disk Access
│ ├── tokenizer.py # 🔡 K-Mer Tokenization (Vocab Size: 69)
│ ├── dataset.py # 📦 PyTorch Dataset (Sliding Window)
│ └── model.py # 🧠 Transformer Encoder (1.2M Params)
├── scripts/
│ └── download_data.sh # 📜 Auto-fetch UCSC Genomes
├── .github/workflows/ # 🤖 CI/CD Research Pipeline
├── main.py # 🚀 Training Loop Simulation
└── README.md # 📄 Documentation
└── requirements.txt # Dependencies
- Problem: The Drosophila genome and Single-Cell Atlas are gigabytes in size. Loading them into RAM crashes standard laptops.
- Solution: Implemented
GenomicDataLoaderusing Biopython Indexing. This allows O(1) random access to any chromosome directly from the disk, keeping RAM usage near zero.
- Instead of One-Hot Encoding (A,C,G,T), we use K-Mers (e.g.,
ATG,TGC). - This captures local context, similar to how Words are more meaningful than Letters in English.
- Vocabulary: 64 combinations + Special Tokens (
[CLS],[PAD]).
- We utilize a custom
GenomicTransformerwith Positional Encoding. - Parameters: ~1.2 Million.
- Task: Regression (Predicting Transcriptome abundance from DNA).
-
Install Dependencies:
pip install -r requirements.txt
-
Download Data:
bash scripts/download_data.sh
-
Run Pipeline:
python main.py
This codebase is designed to be modular.
- Data Ingestion: Located in
src/dataloader.py. To swap the organism (e.g., Humanhg38), update the download script and point the loader to the new FASTA file. - Model Config: Hyperparameters (Layers, Heads, Embedding Dim) can be adjusted in
main.py.
- Phase 2: Integrate
fly_cell_atlas.h5adlabels to train the model on real Gene Expression data. - Phase 3: Implement "In-Silico Mutagenesis" to predict how mutations affect gene regulation.
Author: Lim Wen Gio
Open Source Research Prototype.