Reproducible Snakemake pipeline for ML-ready Perturb-seq datasets
Transforming scRNA-seq CRISPR screens into balanced, harmonized datasets for AI/ML training
| Category | Tools |
|---|---|
| Single-Cell Analysis | Scanpy โฅ1.9.3, AnnData โฅ0.9.0, scrublet (doublet detection) |
| Normalization & Integration | scikit-learn, harmonypy, bbknn |
| Feature Engineering | gseapy (pathway analysis), dorothea (TF regulons) |
| Workflow | Snakemake โฅ7.32.0, Mamba/Conda |
| Visualization | matplotlib, seaborn, Jupyter notebooks |
| Data Processing | numpy โฅ1.23, pandas โฅ2.0, scipy โฅ1.10 |
This pipeline transforms raw single-cell RNA-seq data with genetic perturbations into high-quality, balanced datasets suitable for training ML/AI models to predict perturbation effects. It supports multiple dataset types and provides comprehensive quality control, normalization, and integration capabilities.
Pipeline report for the demo data.
- ๐ Automated Workflow: Snakemake-based pipeline with dependency management
- โ Quality Control: Comprehensive filtering of low-quality cells and genes
- โ๏ธ Class Balancing: Smart downsampling to prevent model bias
- ๐ Batch Integration: Harmonization of multiple datasets using Harmony/BBKNN
- ๐งฌ Feature Engineering: Biological feature extraction (pathways, TFs, regulatory networks)
- ๐ Cross-Validation: Leave-genes-out CV strategy for unseen perturbations
- ๐ฌ Reproducibility: Fully reproducible with Conda environments and version control
- ๐ Interactive Notebooks: Jupyter notebooks for data exploration and visualization
- Python 3.9+
- Conda/Mamba
- Hardware: 4+ cores, 8GB+ RAM (16GB recommended)
- Storage: ~20GB for demo data, ~100GB for full datasets
# Clone repository
git clone https://github.com/ACTN3Bioinformatics/VCC-project.git
cd VCC-project
# Create mamba environment
mamba env create -f environment.yml
mamba activate vcc2025Download demonstration data (optimized subset of Replogle et al. 2022):
# Automatic download and preparation
snakemake download_demo_data --cores 1
# This creates: data_local/demo/replogle_subset.h5ad (~500MB, 10k cells)# Run complete pipeline on demo data
snakemake --cores 4 --configfile config/datasets.yaml
# Run specific stages
snakemake results/demo/filtered.h5ad --cores 4 # QC only
snakemake results/demo/balanced.h5ad --cores 4 # Through balancing
snakemake results/demo/final.h5ad --cores 4 # Complete pipeline
# Dry-run to see execution plan
snakemake -n
# Test configuration
snakemake test --cores 1
# Generate workflow visualization
python scripts/generate_workflow_diagram.py
# Creates: docs/workflow_diagram.png (if graphviz installed)
# or: docs/workflow_dag.txt (always works)# Launch demo exploration notebook
jupyter notebook notebooks/demo_exploration.ipynb
# Or explore processed results
jupyter notebookVCC-project/
โโโ workflows/ # Snakemake workflow definitions
โ โโโ Snakefile # Main workflow entry point
โ โโโ rules/ # Individual pipeline rules
โ โโโ download.smk # Data acquisition
โ โโโ qc.smk # Quality control & filtering
โ โโโ normalize.smk # Normalization & scaling
โ โโโ balance.smk # Class balancing
โ โโโ integrate.smk # Batch integration (Harmony/BBKNN)
โ โโโ split.smk # Train/val/test splits
โ โโโ features.smk # Feature engineering
โโโ scripts/ # Core Python modules
โ โโโ download_demo_data.py
โ โโโ filter_normalize.py
โ โโโ balance.py
โ โโโ integration.py
โ โโโ split_data.py
โ โโโ feature_engineering.py
โ โโโ utils.py
โโโ config/ # Configuration files
โ โโโ datasets.yaml # Dataset-specific parameters
โ โโโ config.yaml # Global pipeline settings
โโโ data_local/ # Local data storage (NOT tracked in Git)
โ โโโ demo/ # Demonstration datasets
โ โโโ raw/ # Raw input data
โ โโโ processed/ # Intermediate outputs
โโโ results/ # Final pipeline outputs
โ โโโ demo/ # Demo results
โโโ reports/ # QC reports and visualizations
โโโ logs/ # Snakemake and script logs
โโโ notebooks/ # Jupyter notebooks
โ โโโ demo_exploration.ipynb # Interactive demo notebook
โโโ docs/ # Extended documentation
โ โโโ PIPELINE_GUIDE.md # Detailed pipeline guide
โ โโโ QUICKSTART.md # 5-minute tutorial
โ โโโ TROUBLESHOOTING.md # Common issues
โโโ tests/ # Unit tests
โ โโโ test_qc.py
โ โโโ test_balance.py
โโโ environment.yml # Conda environment specification
โโโ LICENSE # MIT License
โโโ CITATION.cff # Citation metadata
โโโ CONTRIBUTING.md # Contribution guidelines
โโโ README.md # This file
The pipeline consists of modular stages executed by Snakemake:
- ๐ฅ Data Acquisition - Download and prepare demo data
- ๐ Quality Control - Filter low-quality cells and genes (Scanpy)
- ๐ Normalization - Count normalization and log transformation
- โ๏ธ Class Balancing - Balance perturbation classes (scikit-learn)
- ๐ Batch Integration - Harmonize datasets using Harmony or BBKNN (optional)
- ๐งฌ Feature Engineering - Extract biological features (gseapy, dorothea)
- โ๏ธ Data Splitting - Create train/val/test splits (leave-genes-out strategy)
- ๐ Benchmarking - Evaluate baseline models
For detailed information, see docs/PIPELINE_GUIDE.md.
| Dataset | Description | Size | Purpose | Processing |
|---|---|---|---|---|
| Demo | Replogle K562 subset | ~10k cells | Testing/Learning | Full pipeline |
| Training | H1-hESC CRISPRi | ~300k cells | Model training | Full QC + balancing |
| Validation | H1-hESC validation | ~50k cells | Model selection | Same as training |
| Test | Unseen perturbations | ~50k cells | Final evaluation | Minimal processing |
| Public | External datasets | Variable | Pre-training/augmentation | Full integration |
Customize processing via config/datasets.yaml:
demo:
input_path: "data_local/demo/replogle_subset.h5ad"
output_dir: "results/demo"
# QC thresholds
min_genes: 200
max_genes: 6000
max_pct_mt: 15
min_cells_per_gene: 3
# Processing options
normalize: true
log_transform: true
scale: true
balance: true
target_cells_per_perturbation: 100
# Integration (for multi-batch data)
batch_correction: false
batch_key: "batch"See docs/PIPELINE_GUIDE.md#configuration for all options.
- CPU: 4 cores
- RAM: 8GB
- Storage: 20GB SSD
- Time: ~30 minutes
- CPU: AMD Ryzen 5 7535HS or equivalent (8 cores @ 3.55 GHz)
- RAM: 16GB LPDDR5x-6400
- GPU: AMD Radeon 660M (optional, for ML training)
- Storage: 50GB SSD
- Time: ~15 minutes
- CPU: 16+ cores
- RAM: 64GB+
- GPU: 16GB+ VRAM for deep learning
- Storage: 200GB+ SSD
- Time: ~2-4 hours
Note: Demo data is specifically optimized for laptop processing on AMD Ryzen 5 7535HS system (16GB RAM).
- Quick Start Guide - Get running in 5 minutes
- Pipeline Guide - Complete pipeline documentation
- Troubleshooting - Common issues and solutions
- Demo Notebook - Interactive demo data exploration
- Demo Report - HTML version of the report for the demo data
- Contributing Guide - How to contribute
The notebooks/demo_exploration.ipynb provides an interactive introduction to:
- Loading and inspecting processed data
- Visualizing QC metrics
- Exploring perturbation effects
- Dimensionality reduction (PCA, UMAP)
- Comparing pipeline stages
Launch notebook:
jupyter notebook notebooks/demo_exploration.ipynbRun unit tests:
# All tests
pytest tests/
# Specific test
pytest tests/test_qc.py -v
# With coverage
pytest --cov=scripts tests/
# Test CI locally
./test_ci_locally.shContributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
See CONTRIBUTING.md for detailed guidelines.
Szymon Myrta (kontakt@actn3.pl)
Bioinformatics Specialist | R/Python Programmer | Data Scientist
Open to collaboration!
I'm passionate about applying data science and bioinformatics to advance pharmaceutical research and precision medicine. This repository reflects my commitment to continuous learning, knowledge sharing, and contributing to the open-source community.
- 9+ years of experience in bioinformatics in pharmaceutical and biotech settings
- NGS data analysis (RNA-seq, scRNA-seq, ChIP-seq, TCR/BCR-seq, WES/WGS etc.)
- Functional genomics data analysis (CRISPR / ORF overexpression screens)
- Strong R & Python programming skills, including development of packages and web apps
- Developer of NGS data analysis pipelines and reproducible research workflows
- Data visualization and interpretation of results
- Background in computational biology, cancer genomics, immuno-oncology
- Co-author of multiple peer-reviewed scientific publications in top-tier journals
- Interested in multi-omics data integration, precision medicine, AI-powered analyses
Tech Stack: Python, R/Bioconductor, Snakemake, Scanpy, Seurat, scikit-learn, Quarto, Git, CI/CD
Feel free to reach out for project ideas, consulting, or joint research in bioinformatics and data science.
If you use this pipeline in your research, please cite:
@software{vcc_project_2025,
author = {Szymon Myrta},
title = {VCC-project: Single-Cell CRISPR Perturbation Pipeline},
year = {2025},
url = {https://github.com/ACTN3Bioinformatics/VCC-project},
doi = {10.5281/zenodo.18004721}
}Demo data: If using the demo dataset, please also cite:
- Replogle et al. (2022). "Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq." Cell. DOI: 10.1016/j.cell.2022.05.013
This project is licensed under the MIT License - see the LICENSE file for details.
- Virtual Cell Challenge 2025 organizers
- Replogle et al. for public Perturb-seq data
- scPerturb database for curated datasets
- Scanpy and AnnData developers
- Snakemake community
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: kontakt@actn3.pl
- Portfolio: actn3.github.io/ACTN3