This project implements a high-performance computational pipeline to identify and characterize heterogeneity in Parkinson’s Disease (PD) using single-cell RNA-seq data.
The pipeline operates in four phases:
- **Discovery (GPU):** Uses a custom parallel Genetic Algorithm (GA) to discover “Diverse Subgroup Sets”—combinatorial rules that define distinct cell states.
- **Characterization (CPU):** Performs Differential Enrichment Analysis to determine the biological context (e.g., fibrosis, metabolic stress) of the discovered states.
- **Trajectory Inference (CPU):** Calculates transcriptomic distance from a healthy baseline to infer the chronological progression of disease states.
- **Validation (GPU/CPU):** Uses “Shadow Runs” (blinded re-discovery), external Pathway Analysis (GO/Enrichr), and Clustering comparisons to prove robustness.
To find the best possible solution and ensure its statistical significance, we employ a “random-restart” strategy combined with a massive permutation test.
- **The Array Job:** The
simple_ga.shscript launches a SLURM job array of 100 tasks (Task 0 for real data and 9 permutations, Tasks 1-99 for permuted data). - **The Time Budget:** Each task runs for a fixed wall-clock time (e.g., 36 hours for deep search, 10 hours for validation).
- **The “Random-Restart” Loop:** Inside the time window, the GA performs hundreds of independent restarts to explore the landscape globally.
- **The Final p-value:** We compare the fitness of the “Champion” rule set found in the real data against the distribution of champions found in the 999 null (permuted) datasets.
population_size = 55: Optimized to fit within the ~128GB VRAM of Intel PVC GPUs usingint8tensor operations.num_generations = 300: Depth of a single local search.JOB_TIME_MINS: The “breadth” of the search (total duration).
Instead of traditional “black box” rule extraction, this phase determines the mechanism of the discovered subgroups.
We define a “Subgroup” based on the GA rule (e.g., PALD1=1) and compare it against the background population to find co-expressed genes.
- **Internal Validation (PD vs PD):** Compares Subgroup cells to other PD cells. This identifies the specific “Failure Mode” (e.g., why is this PD cell different from that PD cell?).
- **External Validation (CTRL vs CTRL):** Checks if the subgroup exists in healthy controls and whether it exhibits the same pathological markers (e.g., LINGO1/Fibrosis).
We use gseapy to map the differentially enriched gene lists to the **Gene Ontology (GO)** and **Jensen DISEASES** databases. This provides objective, literature-backed confirmation of the biological state (e.g., “Mitochondrial Myopathy”, “Extracellular Matrix Organization”).
To synthesize the static states into a dynamic model:
- **Trajectory of Failure:** We calculate the Manhattan distance of every cell in a subgroup from the “Healthy Control Centroid.” Shorter distance implies an early/adaptive state; longer distance implies a terminal/divergent state.
- **Orthogonality Check:** We map the discovered subgroups back onto the global UMAP (using R/Seurat) to demonstrate that these functional states cut across standard transcriptomic clusters.
To test for biological redundancy and “Table Mountain” landscapes, we perform a **Shadow Run**.
- **Blinding:** We remove all genes associated with the primary discovery (e.g., PALD1, FTH1, NCOR2) from the dataset.
- **Re-Discovery:** We re-submit the GPU GA on this “scorched earth” dataset.
- **Inference:**
- If the GA fails to find a signal -> The primary discovery is unique (e.g., Fibrotic State).
- If the GA finds a new signal -> We have identified a secondary, redundant disease mechanism hidden by the primary signal (e.g., Metabolic State).
| File | Description |
| :— | :— |
simple_ga.sh | Main wrapper. Launches the array and automatically submits the gather job. |
run_ga_array.slurm | SLURM script for the main 100-job array. Sets up the pytorch-gpu environment. |
simple_ga_array.py | Core Python GA worker. Runs the time-limited restart loop. |
ga_utils.py | Shared utility functions for rule decoding and stats. |
run_gather.slurm | Post-processing job. Collects results and calculates p-values. |
gather_results.py | Analyzes raw results, generates Histograms and Subgroup Plots. |
| File | Description |
| :— | :— |
analyze_subgroups_full.py | Performs Differential Enrichment Analysis for both PD and Control populations. Generates CSVs of co-expressed genes. |
validate_biology.py | Uses gseapy to query Gene Ontology databases for the gene lists found above. |
plot_mechanism_heatmap.py | Generates the “Double Dissociation” heatmap (Fig 3) from enrichment CSVs. |
plot_pathway_validation.py | Generates the GO Bar Chart (Fig 4) from validation CSVs. |
order_cell_states.py | Calculates transcriptomic distance from health to infer the “Trajectory of Failure” (Fig 5). |
fast_plot_rule2.R | R script to map Rule 2 cells onto the Seurat UMAP and check cluster prevalence (Fig 6). |
| File | Description |
| :— | :— |
shadow_ga.py | Modified GA script that points to the sc_data_blinded directory. |
run_shadow.slurm | SLURM script for the validation run (typically 100 jobs, 10 hours). |
# Launch the 36-hour production run (100 jobs)
./simple_ga.sh ./results_prod_run --time=36:00:00Output: final_summary.txt (Rules), fitness_histogram.png (Stats), subgroup_analysis.png.
Run this on the login node or locally (CPU only). It requires sc_data/X_binary.csv.
# 1. Generate Enrichment Tables (PD and CTRL comparison)
python3 analyze_subgroups_full.py
# 2. Validate against Gene Ontology
python3 validate_biology.py
# 3. Generate Publication Figures
python3 plot_mechanism_heatmap.py
python3 plot_pathway_validation.pyOutput: enrichment_*.csv, pathway_validation_*.csv, mechanism_heatmap.png, pathway_validation_figure.png.
# 1. Calculate Trajectory (Distance from Health)
python3 order_cell_states.py
# 2. Check Clustering Overlay (Requires R and Seurat Object)
Rscript fast_plot_rule2.ROutput: trajectory_boxplot.png, rule2_umap_confetti.png.
First, create the blinded dataset (removing top hits):
# (Run snippet to drop columns PALD1, FTH1, etc. and save to sc_data_blinded/)Then submit the Shadow Run:
sbatch run_shadow.slurmOutput: A new set of results in shadow_results/ proving the existence (or absence) of secondary mechanisms.