GPU-Accelerated DSSD and Biological Validation Pipeline

Project Goal

This project implements a high-performance computational pipeline to identify and characterize heterogeneity in Parkinson’s Disease (PD) using single-cell RNA-seq data.

The pipeline operates in four phases:

**Discovery (GPU):** Uses a custom parallel Genetic Algorithm (GA) to discover “Diverse Subgroup Sets”—combinatorial rules that define distinct cell states.
**Characterization (CPU):** Performs Differential Enrichment Analysis to determine the biological context (e.g., fibrosis, metabolic stress) of the discovered states.
**Trajectory Inference (CPU):** Calculates transcriptomic distance from a healthy baseline to infer the chronological progression of disease states.
**Validation (GPU/CPU):** Uses “Shadow Runs” (blinded re-discovery), external Pathway Analysis (GO/Enrichr), and Clustering comparisons to prove robustness.

Phase 1: Genetic Algorithm (Subgroup Discovery)

Statistical Method: Time-Limited Random-Restart Permutation Test

To find the best possible solution and ensure its statistical significance, we employ a “random-restart” strategy combined with a massive permutation test.

**The Array Job:** The simple_ga.sh script launches a SLURM job array of 100 tasks (Task 0 for real data and 9 permutations, Tasks 1-99 for permuted data).
**The Time Budget:** Each task runs for a fixed wall-clock time (e.g., 36 hours for deep search, 10 hours for validation).
**The “Random-Restart” Loop:** Inside the time window, the GA performs hundreds of independent restarts to explore the landscape globally.
**The Final p-value:** We compare the fitness of the “Champion” rule set found in the real data against the distribution of champions found in the 999 null (permuted) datasets.

Key GA Parameters

population_size = 55: Optimized to fit within the ~128GB VRAM of Intel PVC GPUs using int8 tensor operations.
num_generations = 300: Depth of a single local search.
JOB_TIME_MINS: The “breadth” of the search (total duration).

Phase 2: Biological Characterization (Phase 1.5)

Instead of traditional “black box” rule extraction, this phase determines the mechanism of the discovered subgroups.

Differential Enrichment Analysis

We define a “Subgroup” based on the GA rule (e.g., PALD1=1) and compare it against the background population to find co-expressed genes.

**Internal Validation (PD vs PD):** Compares Subgroup cells to other PD cells. This identifies the specific “Failure Mode” (e.g., why is this PD cell different from that PD cell?).
**External Validation (CTRL vs CTRL):** Checks if the subgroup exists in healthy controls and whether it exhibits the same pathological markers (e.g., LINGO1/Fibrosis).

Pathway Validation (ORA)

We use gseapy to map the differentially enriched gene lists to the **Gene Ontology (GO)** and **Jensen DISEASES** databases. This provides objective, literature-backed confirmation of the biological state (e.g., “Mitochondrial Myopathy”, “Extracellular Matrix Organization”).

Phase 3: Trajectory & Orthogonality

To synthesize the static states into a dynamic model:

**Trajectory of Failure:** We calculate the Manhattan distance of every cell in a subgroup from the “Healthy Control Centroid.” Shorter distance implies an early/adaptive state; longer distance implies a terminal/divergent state.
**Orthogonality Check:** We map the discovered subgroups back onto the global UMAP (using R/Seurat) to demonstrate that these functional states cut across standard transcriptomic clusters.

Phase 4: The “Shadow Run” Validation

To test for biological redundancy and “Table Mountain” landscapes, we perform a **Shadow Run**.

**Blinding:** We remove all genes associated with the primary discovery (e.g., PALD1, FTH1, NCOR2) from the dataset.
**Re-Discovery:** We re-submit the GPU GA on this “scorched earth” dataset.
**Inference:**
- If the GA fails to find a signal -> The primary discovery is unique (e.g., Fibrotic State).
- If the GA finds a new signal -> We have identified a secondary, redundant disease mechanism hidden by the primary signal (e.g., Metabolic State).

File Descriptions

Core GA Files (GPU)

File	Description
:—	:—
`simple_ga.sh`	Main wrapper. Launches the array and automatically submits the gather job.
`run_ga_array.slurm`	SLURM script for the main 100-job array. Sets up the `pytorch-gpu` environment.
`simple_ga_array.py`	Core Python GA worker. Runs the time-limited restart loop.
`ga_utils.py`	Shared utility functions for rule decoding and stats.
`run_gather.slurm`	Post-processing job. Collects results and calculates p-values.
`gather_results.py`	Analyzes raw results, generates Histograms and Subgroup Plots.

Analysis & Visualization Files (CPU)

File	Description
:—	:—
`analyze_subgroups_full.py`	Performs Differential Enrichment Analysis for both PD and Control populations. Generates CSVs of co-expressed genes.
`validate_biology.py`	Uses `gseapy` to query Gene Ontology databases for the gene lists found above.
`plot_mechanism_heatmap.py`	Generates the “Double Dissociation” heatmap (Fig 3) from enrichment CSVs.
`plot_pathway_validation.py`	Generates the GO Bar Chart (Fig 4) from validation CSVs.
`order_cell_states.py`	Calculates transcriptomic distance from health to infer the “Trajectory of Failure” (Fig 5).
`fast_plot_rule2.R`	R script to map Rule 2 cells onto the Seurat UMAP and check cluster prevalence (Fig 6).

Shadow Run Files (Validation)

File	Description
:—	:—
`shadow_ga.py`	Modified GA script that points to the `sc_data_blinded` directory.
`run_shadow.slurm`	SLURM script for the validation run (typically 100 jobs, 10 hours).

How to Run the Full Pipeline

Step 1: Run the Phase 1 GA (Discovery)

# Launch the 36-hour production run (100 jobs)
./simple_ga.sh ./results_prod_run --time=36:00:00

Output: final_summary.txt (Rules), fitness_histogram.png (Stats), subgroup_analysis.png.

Step 2: Characterize the Biology (Phase 1.5)

Run this on the login node or locally (CPU only). It requires sc_data/X_binary.csv.

# 1. Generate Enrichment Tables (PD and CTRL comparison)
python3 analyze_subgroups_full.py

# 2. Validate against Gene Ontology
python3 validate_biology.py

# 3. Generate Publication Figures
python3 plot_mechanism_heatmap.py
python3 plot_pathway_validation.py

Output: enrichment_*.csv, pathway_validation_*.csv, mechanism_heatmap.png, pathway_validation_figure.png.

Step 3: Trajectory and Orthogonality

# 1. Calculate Trajectory (Distance from Health)
python3 order_cell_states.py

# 2. Check Clustering Overlay (Requires R and Seurat Object)
Rscript fast_plot_rule2.R

Output: trajectory_boxplot.png, rule2_umap_confetti.png.

Step 4: Run the Shadow Validation

First, create the blinded dataset (removing top hits):

# (Run snippet to drop columns PALD1, FTH1, etc. and save to sc_data_blinded/)

Then submit the Shadow Run:

sbatch run_shadow.slurm

Output: A new set of results in shadow_results/ proving the existence (or absence) of secondary mechanisms.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
00_data_input		00_data_input
01_ga_discovery		01_ga_discovery
02_emm_exploration		02_emm_exploration
03_functional_analysis		03_functional_analysis
04_genetics_magma		04_genetics_magma
05_twas_ai_lit		05_twas_ai_lit
writeup		writeup
.gitignore		.gitignore
readme.org		readme.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU-Accelerated DSSD and Biological Validation Pipeline

Project Goal

Phase 1: Genetic Algorithm (Subgroup Discovery)

Statistical Method: Time-Limited Random-Restart Permutation Test

Key GA Parameters

Phase 2: Biological Characterization (Phase 1.5)

Differential Enrichment Analysis

Pathway Validation (ORA)

Phase 3: Trajectory & Orthogonality

Phase 4: The “Shadow Run” Validation

File Descriptions

Core GA Files (GPU)

Analysis & Visualization Files (CPU)

Shadow Run Files (Validation)

How to Run the Full Pipeline

Step 1: Run the Phase 1 GA (Discovery)

Step 2: Characterize the Biology (Phase 1.5)

Step 3: Trajectory and Orthogonality

Step 4: Run the Shadow Validation

About

Uh oh!

Releases

Packages

Languages

UKDRI/ga_gpu_dssd

Folders and files

Latest commit

History

Repository files navigation

GPU-Accelerated DSSD and Biological Validation Pipeline

Project Goal

Phase 1: Genetic Algorithm (Subgroup Discovery)

Statistical Method: Time-Limited Random-Restart Permutation Test

Key GA Parameters

Phase 2: Biological Characterization (Phase 1.5)

Differential Enrichment Analysis

Pathway Validation (ORA)

Phase 3: Trajectory & Orthogonality

Phase 4: The “Shadow Run” Validation

File Descriptions

Core GA Files (GPU)

Analysis & Visualization Files (CPU)

Shadow Run Files (Validation)

How to Run the Full Pipeline

Step 1: Run the Phase 1 GA (Discovery)

Step 2: Characterize the Biology (Phase 1.5)

Step 3: Trajectory and Orthogonality

Step 4: Run the Shadow Validation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages