🧬 Genomic Decoder (FlyOS)

A Transformer-based AI system designed to decode the regulatory logic of the Drosophila (Fruit Fly) genome.

☁️ Automated Research Pipeline

This repository includes a CI/CD pipeline that ensures the reproducibility of the experiment. Every commit automatically:

Installs the specific PyTorch/NumPy environment.
Downloads the Drosophila genome from UCSC.
Trains the Transformer model on a subset of data.
Publishes the trained model weights as an artifact.

View the Pipeline Logs

🔬 Executive Summary

This project treats the cell not as a biological entity, but as a Computational Operating System.

By applying Natural Language Processing (NLP) techniques—specifically Transformers and K-Mer Tokenization—to raw DNA sequences, this system aims to predict gene expression levels directly from the sequence, effectively "reading" the regulatory code of life.

Core Concept:

Hardware: Chromatin Structure (Nucleosomes)
Software: DNA Sequence (Promoters/Enhancers)
Compiler: Neural Network (Genomic Decoder)

🛠 Architecture

The system handles massive genomic files using a Lazy-Loading strategy to run efficiently on consumer hardware (MacBook Air), despite the dataset size (23M+ base pairs).

genomic-decoder-fly/
├── data/raw/             # 🛑 Stores dm6.fa (Lazy Loaded)
├── src/
│   ├── dataloader.py     # 🧬 Biopython-based O(1) Disk Access
│   ├── tokenizer.py      # 🔡 K-Mer Tokenization (Vocab Size: 69)
│   ├── dataset.py        # 📦 PyTorch Dataset (Sliding Window)
│   └── model.py          # 🧠 Transformer Encoder (1.2M Params)
├── scripts/
│   └── download_data.sh  # 📜 Auto-fetch UCSC Genomes
├── .github/workflows/    # 🤖 CI/CD Research Pipeline
├── main.py               # 🚀 Training Loop Simulation
└── README.md             # 📄 Documentation
└── requirements.txt      # Dependencies

🚀 Key Technical Features

1. Lazy Loading (Big Data Engineering)

Problem: The Drosophila genome and Single-Cell Atlas are gigabytes in size. Loading them into RAM crashes standard laptops.
Solution: Implemented GenomicDataLoader using Biopython Indexing. This allows O(1) random access to any chromosome directly from the disk, keeping RAM usage near zero.

2. K-Mer Tokenization (NLP for Biology)

Instead of One-Hot Encoding (A,C,G,T), we use K-Mers (e.g., ATG, TGC).
This captures local context, similar to how Words are more meaningful than Letters in English.
Vocabulary: 64 combinations + Special Tokens ([CLS], [PAD]).

3. Transformer Backbone

We utilize a custom GenomicTransformer with Positional Encoding.
Parameters: ~1.2 Million.
Task: Regression (Predicting Transcriptome abundance from DNA).

⚙️ How to Run

Install Dependencies:
```
pip install -r requirements.txt
```
Download Data:
```
bash scripts/download_data.sh
```
Run Pipeline:
```
python main.py
```

💻 Developer Guide: For AI & Data Engineers

This codebase is designed to be modular.

Data Ingestion: Located in src/dataloader.py. To swap the organism (e.g., Human hg38), update the download script and point the loader to the new FASTA file.
Model Config: Hyperparameters (Layers, Heads, Embedding Dim) can be adjusted in main.py.

🔮 Future Roadmap

Phase 2: Integrate fly_cell_atlas.h5ad labels to train the model on real Gene Expression data.
Phase 3: Implement "In-Silico Mutagenesis" to predict how mutations affect gene regulation.

Author: Lim Wen Gio
Open Source Research Prototype.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 Genomic Decoder (FlyOS)

☁️ Automated Research Pipeline

🔬 Executive Summary

🛠 Architecture

🚀 Key Technical Features

1. Lazy Loading (Big Data Engineering)

2. K-Mer Tokenization (NLP for Biology)

3. Transformer Backbone

⚙️ How to Run

💻 Developer Guide: For AI & Data Engineers

🔮 Future Roadmap

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
data		data
scripts		scripts
src		src
.DS_Store		.DS_Store
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Lwg78/genomic-decoder-fly

Folders and files

Latest commit

History

Repository files navigation

🧬 Genomic Decoder (FlyOS)

☁️ Automated Research Pipeline

🔬 Executive Summary

🛠 Architecture

🚀 Key Technical Features

1. Lazy Loading (Big Data Engineering)

2. K-Mer Tokenization (NLP for Biology)

3. Transformer Backbone

⚙️ How to Run

💻 Developer Guide: For AI & Data Engineers

🔮 Future Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages