Skip to content

A reproducible computational pipeline for processing and analyzing single-cell RNA-seq data with CRISPR perturbations (Perturb-seq), designed for the Virtual Cell Challenge 2025. Features automated quality control, normalization, class balancing, and batch integration using Snakemake.

License

Notifications You must be signed in to change notification settings

ACTN3Bioinformatics/VCC-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

32 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Single-Cell CRISPR Perturbation Pipeline

CI Python Snakemake License: MIT DOI

Reproducible Snakemake pipeline for ML-ready Perturb-seq datasets

Transforming scRNA-seq CRISPR screens into balanced, harmonized datasets for AI/ML training

๐Ÿ“– Documentation | ๐Ÿš€ Quick Start | ๐Ÿ’ป Live Demo


๐Ÿ› ๏ธ Tech Stack

Core Technologies

Python Snakemake Scanpy Jupyter

Analysis & ML

NumPy Pandas scikit-learn

DevOps & Documentation

GitHub Actions Quarto Conda

Key Libraries

Category Tools
Single-Cell Analysis Scanpy โ‰ฅ1.9.3, AnnData โ‰ฅ0.9.0, scrublet (doublet detection)
Normalization & Integration scikit-learn, harmonypy, bbknn
Feature Engineering gseapy (pathway analysis), dorothea (TF regulons)
Workflow Snakemake โ‰ฅ7.32.0, Mamba/Conda
Visualization matplotlib, seaborn, Jupyter notebooks
Data Processing numpy โ‰ฅ1.23, pandas โ‰ฅ2.0, scipy โ‰ฅ1.10

๐ŸŽฏ Overview

This pipeline transforms raw single-cell RNA-seq data with genetic perturbations into high-quality, balanced datasets suitable for training ML/AI models to predict perturbation effects. It supports multiple dataset types and provides comprehensive quality control, normalization, and integration capabilities.

Pipeline report for the demo data.

Key Features

  • ๐Ÿ”„ Automated Workflow: Snakemake-based pipeline with dependency management
  • โœ… Quality Control: Comprehensive filtering of low-quality cells and genes
  • โš–๏ธ Class Balancing: Smart downsampling to prevent model bias
  • ๐Ÿ”— Batch Integration: Harmonization of multiple datasets using Harmony/BBKNN
  • ๐Ÿงฌ Feature Engineering: Biological feature extraction (pathways, TFs, regulatory networks)
  • ๐Ÿ“Š Cross-Validation: Leave-genes-out CV strategy for unseen perturbations
  • ๐Ÿ”ฌ Reproducibility: Fully reproducible with Conda environments and version control
  • ๐Ÿ““ Interactive Notebooks: Jupyter notebooks for data exploration and visualization

๐Ÿ“‹ Quick Start

Prerequisites

  • Python 3.9+
  • Conda/Mamba
  • Hardware: 4+ cores, 8GB+ RAM (16GB recommended)
  • Storage: ~20GB for demo data, ~100GB for full datasets

Installation

# Clone repository
git clone https://github.com/ACTN3Bioinformatics/VCC-project.git
cd VCC-project

# Create mamba environment
mamba env create -f environment.yml
mamba activate vcc2025

Demo Data Setup

Download demonstration data (optimized subset of Replogle et al. 2022):

# Automatic download and preparation
snakemake download_demo_data --cores 1

# This creates: data_local/demo/replogle_subset.h5ad (~500MB, 10k cells)

Running the Pipeline

# Run complete pipeline on demo data
snakemake --cores 4 --configfile config/datasets.yaml

# Run specific stages
snakemake results/demo/filtered.h5ad --cores 4          # QC only
snakemake results/demo/balanced.h5ad --cores 4          # Through balancing
snakemake results/demo/final.h5ad --cores 4             # Complete pipeline

# Dry-run to see execution plan
snakemake -n

# Test configuration
snakemake test --cores 1

# Generate workflow visualization
python scripts/generate_workflow_diagram.py
# Creates: docs/workflow_diagram.png (if graphviz installed)
#      or: docs/workflow_dag.txt (always works)

Explore with Jupyter Notebook

# Launch demo exploration notebook
jupyter notebook notebooks/demo_exploration.ipynb

# Or explore processed results
jupyter notebook

๐Ÿ“ Project Structure

VCC-project/
โ”œโ”€โ”€ workflows/              # Snakemake workflow definitions
โ”‚   โ”œโ”€โ”€ Snakefile          # Main workflow entry point
โ”‚   โ””โ”€โ”€ rules/             # Individual pipeline rules
โ”‚       โ”œโ”€โ”€ download.smk   # Data acquisition
โ”‚       โ”œโ”€โ”€ qc.smk         # Quality control & filtering
โ”‚       โ”œโ”€โ”€ normalize.smk  # Normalization & scaling
โ”‚       โ”œโ”€โ”€ balance.smk    # Class balancing
โ”‚       โ”œโ”€โ”€ integrate.smk  # Batch integration (Harmony/BBKNN)
โ”‚       โ”œโ”€โ”€ split.smk      # Train/val/test splits
โ”‚       โ””โ”€โ”€ features.smk   # Feature engineering
โ”œโ”€โ”€ scripts/               # Core Python modules
โ”‚   โ”œโ”€โ”€ download_demo_data.py
โ”‚   โ”œโ”€โ”€ filter_normalize.py
โ”‚   โ”œโ”€โ”€ balance.py
โ”‚   โ”œโ”€โ”€ integration.py
โ”‚   โ”œโ”€โ”€ split_data.py
โ”‚   โ”œโ”€โ”€ feature_engineering.py
โ”‚   โ””โ”€โ”€ utils.py
โ”œโ”€โ”€ config/                # Configuration files
โ”‚   โ”œโ”€โ”€ datasets.yaml      # Dataset-specific parameters
โ”‚   โ””โ”€โ”€ config.yaml        # Global pipeline settings
โ”œโ”€โ”€ data_local/            # Local data storage (NOT tracked in Git)
โ”‚   โ”œโ”€โ”€ demo/             # Demonstration datasets
โ”‚   โ”œโ”€โ”€ raw/              # Raw input data
โ”‚   โ””โ”€โ”€ processed/        # Intermediate outputs
โ”œโ”€โ”€ results/               # Final pipeline outputs
โ”‚   โ””โ”€โ”€ demo/             # Demo results
โ”œโ”€โ”€ reports/               # QC reports and visualizations
โ”œโ”€โ”€ logs/                  # Snakemake and script logs
โ”œโ”€โ”€ notebooks/             # Jupyter notebooks
โ”‚   โ””โ”€โ”€ demo_exploration.ipynb  # Interactive demo notebook
โ”œโ”€โ”€ docs/                  # Extended documentation
โ”‚   โ”œโ”€โ”€ PIPELINE_GUIDE.md  # Detailed pipeline guide
โ”‚   โ”œโ”€โ”€ QUICKSTART.md     # 5-minute tutorial
โ”‚   โ””โ”€โ”€ TROUBLESHOOTING.md # Common issues
โ”œโ”€โ”€ tests/                 # Unit tests
โ”‚   โ”œโ”€โ”€ test_qc.py
โ”‚   โ””โ”€โ”€ test_balance.py
โ”œโ”€โ”€ environment.yml        # Conda environment specification
โ”œโ”€โ”€ LICENSE               # MIT License
โ”œโ”€โ”€ CITATION.cff          # Citation metadata
โ”œโ”€โ”€ CONTRIBUTING.md       # Contribution guidelines
โ””โ”€โ”€ README.md             # This file

๐Ÿ”ฌ Pipeline Overview

The pipeline consists of modular stages executed by Snakemake:

  1. ๐Ÿ“ฅ Data Acquisition - Download and prepare demo data
  2. ๐Ÿ” Quality Control - Filter low-quality cells and genes (Scanpy)
  3. ๐Ÿ“Š Normalization - Count normalization and log transformation
  4. โš–๏ธ Class Balancing - Balance perturbation classes (scikit-learn)
  5. ๐Ÿ”— Batch Integration - Harmonize datasets using Harmony or BBKNN (optional)
  6. ๐Ÿงฌ Feature Engineering - Extract biological features (gseapy, dorothea)
  7. โœ‚๏ธ Data Splitting - Create train/val/test splits (leave-genes-out strategy)
  8. ๐Ÿ“ˆ Benchmarking - Evaluate baseline models

For detailed information, see docs/PIPELINE_GUIDE.md.


๐Ÿ“Š Dataset Types

Dataset Description Size Purpose Processing
Demo Replogle K562 subset ~10k cells Testing/Learning Full pipeline
Training H1-hESC CRISPRi ~300k cells Model training Full QC + balancing
Validation H1-hESC validation ~50k cells Model selection Same as training
Test Unseen perturbations ~50k cells Final evaluation Minimal processing
Public External datasets Variable Pre-training/augmentation Full integration

๐Ÿ”ง Configuration

Customize processing via config/datasets.yaml:

demo:
  input_path: "data_local/demo/replogle_subset.h5ad"
  output_dir: "results/demo"
  
  # QC thresholds
  min_genes: 200
  max_genes: 6000
  max_pct_mt: 15
  min_cells_per_gene: 3
  
  # Processing options
  normalize: true
  log_transform: true
  scale: true
  balance: true
  target_cells_per_perturbation: 100
  
  # Integration (for multi-batch data)
  batch_correction: false
  batch_key: "batch"

See docs/PIPELINE_GUIDE.md#configuration for all options.


๐Ÿ’ป System Requirements

Minimum (Demo Data)

  • CPU: 4 cores
  • RAM: 8GB
  • Storage: 20GB SSD
  • Time: ~30 minutes

Recommended (Demo Data)

  • CPU: AMD Ryzen 5 7535HS or equivalent (8 cores @ 3.55 GHz)
  • RAM: 16GB LPDDR5x-6400
  • GPU: AMD Radeon 660M (optional, for ML training)
  • Storage: 50GB SSD
  • Time: ~15 minutes

Full VCC 2025 Dataset

  • CPU: 16+ cores
  • RAM: 64GB+
  • GPU: 16GB+ VRAM for deep learning
  • Storage: 200GB+ SSD
  • Time: ~2-4 hours

Note: Demo data is specifically optimized for laptop processing on AMD Ryzen 5 7535HS system (16GB RAM).


๐Ÿ“š Documentation


๐Ÿ““ Jupyter Notebooks

Demo Exploration Notebook

The notebooks/demo_exploration.ipynb provides an interactive introduction to:

  • Loading and inspecting processed data
  • Visualizing QC metrics
  • Exploring perturbation effects
  • Dimensionality reduction (PCA, UMAP)
  • Comparing pipeline stages

Launch notebook:

jupyter notebook notebooks/demo_exploration.ipynb

๐Ÿงช Testing

Run unit tests:

# All tests
pytest tests/

# Specific test
pytest tests/test_qc.py -v

# With coverage
pytest --cov=scripts tests/

# Test CI locally
./test_ci_locally.sh

๐Ÿค Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.


๐Ÿ‘ค Author

Szymon Myrta (kontakt@actn3.pl)

Bioinformatics Specialist | R/Python Programmer | Data Scientist

Open to collaboration!

I'm passionate about applying data science and bioinformatics to advance pharmaceutical research and precision medicine. This repository reflects my commitment to continuous learning, knowledge sharing, and contributing to the open-source community.

  • 9+ years of experience in bioinformatics in pharmaceutical and biotech settings
  • NGS data analysis (RNA-seq, scRNA-seq, ChIP-seq, TCR/BCR-seq, WES/WGS etc.)
  • Functional genomics data analysis (CRISPR / ORF overexpression screens)
  • Strong R & Python programming skills, including development of packages and web apps
  • Developer of NGS data analysis pipelines and reproducible research workflows
  • Data visualization and interpretation of results
  • Background in computational biology, cancer genomics, immuno-oncology
  • Co-author of multiple peer-reviewed scientific publications in top-tier journals
  • Interested in multi-omics data integration, precision medicine, AI-powered analyses

Tech Stack: Python, R/Bioconductor, Snakemake, Scanpy, Seurat, scikit-learn, Quarto, Git, CI/CD

Feel free to reach out for project ideas, consulting, or joint research in bioinformatics and data science.


๐Ÿ“ Citation

If you use this pipeline in your research, please cite:

@software{vcc_project_2025,
  author = {Szymon Myrta},
  title = {VCC-project: Single-Cell CRISPR Perturbation Pipeline},
  year = {2025},
  url = {https://github.com/ACTN3Bioinformatics/VCC-project},
  doi = {10.5281/zenodo.18004721}
}

Demo data: If using the demo dataset, please also cite:

  • Replogle et al. (2022). "Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq." Cell. DOI: 10.1016/j.cell.2022.05.013

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • Virtual Cell Challenge 2025 organizers
  • Replogle et al. for public Perturb-seq data
  • scPerturb database for curated datasets
  • Scanpy and AnnData developers
  • Snakemake community

๐Ÿ“ฎ Contact


Status: ๐Ÿš€ Active Development | Version: 1.0.0 | Last Updated: December 2025

Portfolio

About

A reproducible computational pipeline for processing and analyzing single-cell RNA-seq data with CRISPR perturbations (Perturb-seq), designed for the Virtual Cell Challenge 2025. Features automated quality control, normalization, class balancing, and batch integration using Snakemake.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published