Skip to content

End-to-end Deep Learning pipeline (PyTorch/Transformers) that reads raw DNA to predict gene expression. Features O(1) lazy-loading for massive datasets.

Notifications You must be signed in to change notification settings

Lwg78/genomic-decoder-fly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 Genomic Decoder (FlyOS)

A Transformer-based AI system designed to decode the regulatory logic of the Drosophila (Fruit Fly) genome.

Status Python PyTorch Research Pipeline

☁️ Automated Research Pipeline

This repository includes a CI/CD pipeline that ensures the reproducibility of the experiment. Every commit automatically:

  1. Installs the specific PyTorch/NumPy environment.
  2. Downloads the Drosophila genome from UCSC.
  3. Trains the Transformer model on a subset of data.
  4. Publishes the trained model weights as an artifact.

View the Pipeline Logs

🔬 Executive Summary

This project treats the cell not as a biological entity, but as a Computational Operating System.

By applying Natural Language Processing (NLP) techniques—specifically Transformers and K-Mer Tokenization—to raw DNA sequences, this system aims to predict gene expression levels directly from the sequence, effectively "reading" the regulatory code of life.

Core Concept:

  • Hardware: Chromatin Structure (Nucleosomes)
  • Software: DNA Sequence (Promoters/Enhancers)
  • Compiler: Neural Network (Genomic Decoder)

🛠 Architecture

The system handles massive genomic files using a Lazy-Loading strategy to run efficiently on consumer hardware (MacBook Air), despite the dataset size (23M+ base pairs).

genomic-decoder-fly/
├── data/raw/             # 🛑 Stores dm6.fa (Lazy Loaded)
├── src/
│   ├── dataloader.py     # 🧬 Biopython-based O(1) Disk Access
│   ├── tokenizer.py      # 🔡 K-Mer Tokenization (Vocab Size: 69)
│   ├── dataset.py        # 📦 PyTorch Dataset (Sliding Window)
│   └── model.py          # 🧠 Transformer Encoder (1.2M Params)
├── scripts/
│   └── download_data.sh  # 📜 Auto-fetch UCSC Genomes
├── .github/workflows/    # 🤖 CI/CD Research Pipeline
├── main.py               # 🚀 Training Loop Simulation
└── README.md             # 📄 Documentation
└── requirements.txt      # Dependencies

🚀 Key Technical Features

1. Lazy Loading (Big Data Engineering)

  • Problem: The Drosophila genome and Single-Cell Atlas are gigabytes in size. Loading them into RAM crashes standard laptops.
  • Solution: Implemented GenomicDataLoader using Biopython Indexing. This allows O(1) random access to any chromosome directly from the disk, keeping RAM usage near zero.

2. K-Mer Tokenization (NLP for Biology)

  • Instead of One-Hot Encoding (A,C,G,T), we use K-Mers (e.g., ATG, TGC).
  • This captures local context, similar to how Words are more meaningful than Letters in English.
  • Vocabulary: 64 combinations + Special Tokens ([CLS], [PAD]).

3. Transformer Backbone

  • We utilize a custom GenomicTransformer with Positional Encoding.
  • Parameters: ~1.2 Million.
  • Task: Regression (Predicting Transcriptome abundance from DNA).

⚙️ How to Run

  1. Install Dependencies:

    pip install -r requirements.txt
  2. Download Data:

    bash scripts/download_data.sh
  3. Run Pipeline:

    python main.py

💻 Developer Guide: For AI & Data Engineers

This codebase is designed to be modular.

  • Data Ingestion: Located in src/dataloader.py. To swap the organism (e.g., Human hg38), update the download script and point the loader to the new FASTA file.
  • Model Config: Hyperparameters (Layers, Heads, Embedding Dim) can be adjusted in main.py.

🔮 Future Roadmap

  • Phase 2: Integrate fly_cell_atlas.h5ad labels to train the model on real Gene Expression data.
  • Phase 3: Implement "In-Silico Mutagenesis" to predict how mutations affect gene regulation.

Author: Lim Wen Gio
Open Source Research Prototype.

About

End-to-end Deep Learning pipeline (PyTorch/Transformers) that reads raw DNA to predict gene expression. Features O(1) lazy-loading for massive datasets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •