Skip to content

settylab/kompot_de_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kompot Stability and Runtime Benchmarking Pipeline

A comprehensive stability and performance benchmarking pipeline for differential abundance (DA) and differential expression (DE) analysis using the kompot package on single-cell RNA-seq data. This pipeline provides systematic testing of parameter stability, subsampling robustness, and runtime scaling characteristics on SLURM-based compute clusters.

🚀 Quick Navigation

New Users:

  1. Installation Guide → Set up your environment
  2. Complete Workflow → Run your first analysis
  3. Documentation Library → Find specific guides

Existing Users:

Overview

This repository implements a production-ready benchmarking framework with five main components:

  1. 📋 Configuration Management: Centralized dataset and parameter configurations
  2. 🔬 Data Subsampling: Generate multiple rounds of subsampled data for robustness testing
  3. 📊 Stability Testing: Test stability across parameters (ls_factor, sigma, n_components) and cell counts (subsampling)
  4. 📈 Stability Analysis: Correlation analysis and publication-ready visualization of DA/DE stability
  5. ⚡ Runtime Analysis: Performance scaling analysis across parameters, cell counts, and gene counts

The pipeline supports:

  • Parameter stability: Test sensitivity of DA and DE results to kompot parameters
  • Subsampling stability: Measure robustness across different cell counts (100 to 400k+ cells)
  • Runtime profiling: Systematic measurement of computational performance
  • Multi-tool benchmarking: Compare kompot against other DA/DE methods
  • SLURM array jobs: Massive parallelization on compute clusters
  • Publication-ready outputs: Automated visualization and correlation analysis

Complete Workflow

Step 1: Configuration Setup

The pipeline uses centralized configuration management for all datasets and parameters.

# List all available datasets and parameters
python processing_config.py --list

# List subsampling configurations
python subsampling_config.py --list

Key configuration files:

  • processing_config.py: Dataset configs, parameter stability settings, SLURM parameters
  • subsampling_config.py: Subsampling parameters and resource allocation

Step 2: Data Subsampling (For Cell Count Stability)

Generate subsampled datasets from original files. Required for subsampling stability analysis to test robustness across different cell counts. Parameter stability tests (ls_factor, sigma, n_components) use original datasets directly.

# Submit subsampling for a single dataset
python subsampling_config.py --submit aging

# Submit multiple datasets
python subsampling_config.py --submit aging covid bcr-xl

# Submit all configured datasets
python subsampling_config.py --submit-all

# Monitor subsampling progress
squeue -u $USER --name=subsample_aging

What this creates:

data/subsampled/{dataset}/
├── round_001_n_8009.h5ad      # Multiple rounds of subsampled data
├── round_002_n_7988.h5ad      # Each with decreasing cell counts
├── round_003_n_7967.h5ad      # For testing stability across cell counts
└── ...

📖 Detailed documentation: README_SUBSAMPLING.md

Step 3: Stability Testing

Test stability of both DA and DE results across different parameters and cell counts. Automatically measures runtime for performance analysis.

# Parameter stability: Test sensitivity to kompot parameters
python processing_config.py --submit-stability aging ls_factor      # LS factor (201 values)
python processing_config.py --submit-stability aging sigma          # Sigma (201 values)
python processing_config.py --submit-stability aging n_components   # Diffusion components (1-100)
python processing_config.py --submit-stability aging n_landmarks    # GP landmarks (affects runtime!)
python processing_config.py --submit-stability aging topn_genes     # Gene count (runtime scaling)

# Subsampling stability: Test robustness across cell counts (requires Step 2)
python processing_config.py --submit-stability aging subsampling    # Cell count robustness

# Test all stability parameters for a dataset
python processing_config.py --submit-all-stability aging

# Preview what would be generated
python processing_config.py --preview aging ls_factor
python processing_config.py --preview aging subsampling

What this creates:

data/processed/{dataset}_{parameter}/
├── original_ls_factor_0.125.h5ad                        # Parameter sweeps
├── original_ls_factor_0.250.h5ad                        # DA and DE results
├── original_ls_factor_0.250.h5ad.runinfo.yml            # Runtime data
└── ...

data/processed/{dataset}_kompot_subsampling/
├── round_001_n_8009_subsampling_subsampled_round_001.h5ad      # Subsampling stability
├── round_002_n_7988_subsampling_subsampled_round_002.h5ad      # DA/DE + runtime data
└── ...

Step 4: Stability Analysis and Visualization

Analyze stability of DA and DE results with correlation analysis and publication-ready plots.

# Run the stability analysis notebook
jupyter notebook plot_stability.ipynb

Or programmatically:

from gather_stability_columns import analyze_parameter_stability

# DA stability analysis
results = analyze_parameter_stability(
    stability_dir="data/processed/aging_ls_factor",
    column_name="kompot_da_log_fold_change_Young_to_Old",
    table_type="obs",
    save_plots=True,
    output_dir="plots/stability_analysis"
)

# DE stability analysis
results = analyze_parameter_stability(
    stability_dir="data/processed/aging_ls_factor",
    column_name="kompot_de_mahalanobis_Young_to_Old",
    table_type="var",
    save_plots=True,
    output_dir="plots/stability_analysis"
)

📖 Detailed documentation: README_GATHER_RESULTS.md

Step 5: Runtime Analysis

Analyze runtime scaling across parameters, cell counts, and gene counts for both DA and DE.

# Run the runtime analysis notebook
jupyter notebook runtime_analysis.ipynb

Or programmatically:

from runtime_utils import collect_all_runtime_data, plot_runtime_scaling
from pathlib import Path

# Collect all runtime data
all_runtime_data = collect_all_runtime_data()

# Plot DA runtime scaling across cell counts
plot_runtime_scaling(
    all_runtime_data['subsampling'],
    parameter_type='subsampling',
    analysis_type='da',
    output_dir=Path('runtime_analysis_plots'),
    show_legend_separately=True,
    min_param_value=101
)

# Plot DE runtime scaling across gene counts
plot_runtime_scaling(
    all_runtime_data['topn_genes'],
    parameter_type='topn_genes',
    analysis_type='de',
    output_dir=Path('runtime_analysis_plots'),
    show_legend_separately=True
)

Runtime analysis capabilities:

  • Parameter scaling: How runtime changes with ls_factor, sigma, n_components, n_landmarks
  • Cell count scaling: Performance from 100 to 400k+ cells (subsampling)
  • Gene count scaling: Performance across different numbers of genes (topn_genes)
  • CPU parallelization: Compare 1-CPU vs 16-CPU performance
  • DA vs DE performance: Separate analysis for differential abundance and expression

Key Features

Centralized Configuration Management

  • Unified parameter management across all datasets and analysis stages
  • Dynamic parameter sequence generation for stability testing
  • Support for multiple dataset "flavors" (e.g., with/without read counts)
  • SLURM resource optimization per dataset

Production-Ready Subsampling Infrastructure

  • Asynchronous I/O with configurable parallel writers
  • Cell-only or cell+read subsampling options
  • PCA computation and UMAP preservation
  • Comprehensive logging and progress monitoring
  • Built-in infinite loop protection

Comprehensive Stability Testing

  • Parameter stability: ls_factor (201 values), sigma (201 values), n_components (1-100), n_landmarks (131 values)
  • Subsampling stability: Original dataset vs. multiple subsampling rounds (100 to 400k+ cells)
  • Runtime profiling: Automatic timing for all DA and DE operations
  • Both DA and DE: Test stability for differential abundance and differential expression simultaneously

High-Performance Processing

  • SLURM array job support for massive parallelization
  • Memory-optimized processing with configurable disk storage
  • Automatic environment detection and activation
  • Intelligent error handling and recovery
  • CPU parallelization testing (1-CPU vs 16-CPU comparisons)

Advanced Analysis and Visualization

  • Stability analysis: Parallel data loading, multiple correlation methods (Spearman, Pearson, low-density)
  • Runtime analysis: Performance scaling plots across parameters, cell counts, and gene counts
  • Parameter-aware plot formatting with scientific notation
  • Automated heatmap generation for all dataset/parameter combinations
  • Publication-ready outputs for both DA and DE results

Data Structure

data/
├── raw/                                 # Original datasets (configured in processing_config.py)
├── subsampled/{dataset}/                # Step 2: Subsampled data
│   ├── round_001_n_8009.h5ad           # Multiple rounds with decreasing cell counts
│   ├── round_002_n_7988.h5ad
│   └── subsampling_aging_*.log         # Subsampling logs
├── processed/{dataset}/                 # Step 3: Standard processing results
│   ├── round_001.h5ad                  # Kompot analysis results
│   ├── round_001.h5ad.obs.csv          # Cell metadata with DA results
│   ├── round_001.h5ad.var.csv          # Gene metadata with DE results
│   └── round_001.h5ad.runinfo.yml      # Processing metadata and timing
└── processed/{dataset}_{parameter}/     # Step 4: Parameter stability results
    ├── original_ls_factor_0.125.h5ad   # Results for each parameter value
    ├── original_ls_factor_0.250.h5ad   # Using original dataset
    └── ...

plots/stability_heatmaps/               # Step 5: Analysis outputs
├── aging_ls_factor_obs_kompot_da_log_fold_change_Young_to_Old_spearman.png
├── analysis_summary.csv               # Comprehensive results summary
└── ...

logs/                                  # SLURM job logs
├── subsample_aging_*.{out,err}        # Subsampling logs
├── kompot_*.{out,err}                 # Processing logs
└── ...

Supported Datasets

Currently Configured

  • aging: Aging hematopoiesis dataset (Young vs Old, ~8000 cells → 398 subsampled files)
  • aging_with_reads: Aging dataset with read count normalization (314 subsampled files)
  • covid: COVID-19 PBMC dataset (healthy vs COVID-19, 790 subsampled files)
  • post-covid: COVID-19 PBMC dataset (healthy vs Post-COVID-19, 790 subsampled files)
  • bcr-xl: BCR-XL stimulation dataset (control vs BCR-XL)

Dataset Requirements

Each dataset configuration includes:

  • Original file path: For parameter stability testing
  • Output patterns: For subsampling and processing
  • Column mappings: Sample, grouping, cell type columns
  • Analysis conditions: Two-group comparisons
  • Resource requirements: Memory, time, CPU allocation per analysis type

Analysis Outputs

Differential Abundance Results (in .obs.csv)

  • kompot_da_log_fold_change_{cond1}_to_{cond2}: DA without sample variance
  • kompot_da_log_fold_change_zscore_{cond1}_to_{cond2}: DA z-score
  • kompot_da_neg_log10_fold_change_pvalue_{cond1}_to_{cond2}: DA p-value
  • kompot_da_log_fold_change_{cond1}_to_{cond2}_sample_var: DA with sample variance

Differential Expression Results (in .var.csv)

  • kompot_de_mahalanobis_{cond1}_to_{cond2}: DE Mahalanobis distance
  • kompot_de_mean_lfc_{cond1}_to_{cond2}: Mean log fold change
  • kompot_de_mahalanobis_{cond1}_to_{cond2}_sample_var: DE with sample variance

Runtime Data (in .runinfo.yml)

  • DA timing: da_no_sample_seconds, da_with_sample_seconds
  • DE timing: de_no_sample_seconds, de_with_sample_seconds
  • PCA/diffusion maps timing: Preprocessing step timings
  • Configuration: CPU count, batch size, null genes, node information
  • Dataset metadata: Number of cells, genes, parameter values

Stability Analysis Outputs

  • Correlation heatmaps: DA and DE parameter stability visualization
  • Correlation vs parameter plots: Stability trends across parameter ranges
  • Summary statistics: Mean/min correlations for DA and DE
  • Interactive analysis: Jupyter notebooks (plot_stability.ipynb, runtime_analysis.ipynb)

Runtime Analysis Outputs

  • Scaling plots: Performance vs parameters, cell counts, gene counts
  • DA vs DE comparison: Separate runtime analysis for each
  • CPU parallelization: 1-CPU vs 16-CPU performance comparison
  • Publication-ready figures: PNG plots with separate legends

Performance Characteristics

Resource Requirements by Stage

Based on analysis of 2,776 real processing jobs across all datasets:

Standard Processing (Recommended Production Settings)

sbatch --mem=32G --time=8:00:00 --cpus-per-task=8 submit_benchmark.sh DATASET
  • Success rate: 98-100% for datasets <200k cells
  • Typical runtime: 3.7-4.9 hours
  • Memory usage: 16-32GB sufficient for most datasets
  • Processing time breakdown:
    • PCA: 0s (often pre-computed)
    • Diffusion Maps: 2-65s
    • DA (no sample): 36-874s
    • DA (with sample): 76-1,732s
    • DE (no sample): 31-569s
    • DE (with sample): 13,137s (main bottleneck)

Large Dataset Processing (>400k cells)

sbatch --mem=64G --time=12:00:00 --cpus-per-task=8 submit_benchmark.sh DATASET TOPN_GENES=200
  • COVID datasets (422k cells): 92.7% failure rate with 16GB, requires optimization
  • Memory scaling: ~109GB per 1M cells (based on allocation errors)
  • Optimization required: Reduce TOPN_GENES, enable batching, consider subsampling

Parameter Stability Testing

sbatch --array=1-201 --mem=32G --time=8:00:00 --cpus-per-task=8 submit_benchmark.sh DATASET_PARAM
  • Scale factor: 100-201x base processing requirements
  • Total compute: 201 jobs × 4-8 hours = 800-1,600 CPU-hours per dataset
  • Success pattern: Same as base processing, scales linearly

Resource Requirements by Dataset Size

Dataset Size Recommended Memory Time Limit Success Rate Common Issues
Small (<100k cells) 32GB 8 hours 99.6% Rare memory errors
Medium (100k-200k cells) 32GB 8 hours 98-100% Stable processing
Large (200k-400k cells) 32-64GB 12 hours Variable Monitor closely
Very Large (>400k cells) 64GB+ 12+ hours 7-53% Requires optimization

Memory Failure Analysis

  • Primary failure mode: DE without sample variance (16.3% overall failure rate)
  • Memory error pattern: "Unable to allocate X GiB for array with shape (cells, genes)"
  • Critical threshold: ~400k cells with standard settings
  • Mitigation strategies: Reduce TOPN_GENES, enable STORE_ARRAYS_ON_DISK, use batching

Scaling

  • Subsampling: 1 job per dataset (generates 50-790 files depending on dataset)
  • Processing: Array jobs (1 job per file, 100-790 parallel jobs)
  • Stability testing: Array jobs (1 job per parameter value, 100-201 parallel jobs)
  • Analysis: Local machine post-processing (minutes for correlation computation)

Prerequisites and Setup

⚠️ Critical Configuration Required

Before running any pipeline jobs, you MUST configure micromamba to disable lock files:

# REQUIRED: Prevent job failures in concurrent environments
micromamba config set use_lockfiles False

Why: Large SLURM array jobs (100-398 concurrent processes) will fail with mamba lock contention errors without this setting.

Environment Setup

📋 Complete setup instructions: INSTALLATION_GUIDE.md

Quick setup for kompot-only usage:

micromamba create -n kompot_v1 python=3.10 -c conda-forge
micromamba activate kompot_v1
micromamba install -c conda-forge -c bioconda pandas numpy scipy matplotlib seaborn anndata scanpy h5py pyyaml tqdm jupyter scikit-learn umap-learn
pip install kompot

Full setup for multi-tool benchmarking: See INSTALLATION_GUIDE.md for complete instructions including all tool environments.

Requirements

  • Micromamba/Mamba 1.4.0+ and Python 3.10+
  • SLURM cluster with array job support
  • Storage: ~500GB per dataset for full workflow

Documentation

📖 Complete Documentation Library

Getting Started

Specialized Workflows

Technical Documentation

Development

  • CLAUDE.md - AI assistant project instructions and codebase context

Key Scripts and Utilities

Configuration and Submission

  • subsampling_config.py - Subsampling job submission and configuration
  • processing_config.py - Processing job submission and parameter stability
  • submit_benchmark.sh - SLURM submission script for processing
  • submit_subsampling.sh - SLURM submission script for subsampling

Core Processing

  • process_single_file.py - Main processing script for individual H5AD files
  • subsample_dataset.py - Dataset subsampling implementation

Analysis and Visualization

  • gather_stability_columns.py - DA/DE stability analysis and correlation computation
  • runtime_utils.py - Runtime data extraction and performance plotting
  • plot_stability.ipynb - DA/DE parameter and subsampling stability analysis notebook
  • runtime_analysis.ipynb - Comprehensive runtime scaling analysis notebook
  • benchmark_plots_accuracy.ipynb - Multi-tool benchmarking accuracy analysis

Testing and Validation

  • test_tool_functionality.py - Test tool implementations
  • test_r_tools_with_renv.py - Test R-based tools with renv
  • run_comprehensive_tests.py - Complete testing suite
  • validate_result_format.py - Validate standardized result formats

Utilities and Inspection

  • count_project_results.py - Count all computational runs across project
  • inspect_subsampling.py - Monitor subsampling progress
  • inspect_processing.py - Monitor processing progress
  • analyze_resource_usage.py - Resource usage analysis from logs

Quick Start Examples

Complete Workflow for New Dataset

# 1. Add dataset to processing_config.py and subsampling_config.py

# 2. Generate subsampled data (for cell count stability)
python subsampling_config.py --submit new_dataset

# 3. Process subsampled data (generates DA, DE, and runtime data)
python processing_config.py --submit-default new_dataset

# 4. Test parameter stability (ls_factor, sigma, n_components, etc.)
python processing_config.py --submit-all-stability new_dataset

# 5. Analyze DA/DE stability
jupyter notebook plot_stability.ipynb

# 6. Analyze runtime scaling
jupyter notebook runtime_analysis.ipynb

Monitoring and Debugging

# Monitor all jobs
squeue -u $USER

# Check specific job types
squeue -u $USER --name=subsample_aging
squeue -u $USER --name=kompot_benchmark

# View logs
tail -f logs/subsample_aging_*.out
tail -f data/subsampled/aging/subsampling_*.log

# Inspect subsampling progress
python inspect_subsampling.py --dataset aging

Contributing

Adding New Datasets

  1. Configure subsampling: Add to subsampling_config.py with input path and parameters
  2. Configure processing: Add to processing_config.py with column mappings and conditions
  3. Test pipeline: Run subsampling → processing → stability testing
  4. Validate results: Check outputs and run correlation analysis

Adding New Parameters

  1. Define parameter range: Add to STABILITY_CONFIGS in processing_config.py
  2. Implement parameter application: Update process_single_file.py
  3. Update result parsing: Modify gather_stability_columns.py filename patterns
  4. Test stability analysis: Verify correlation plots work correctly

Citation

Please cite the kompot package and relevant datasets when using this pipeline:

@software{kompot2024,
  title={kompot: Differential abundance and expression analysis for single-cell data},
  author={Setty Lab},
  year={2024},
  url={https://github.com/settylab/kompot}
}

Troubleshooting

Common Issues by Stage

Subsampling Issues:

  • Input file not found: Check file paths in subsampling_config.py
  • Memory errors: Increase SLURM memory or reduce max_write_workers
  • Infinite loops: Fixed in current version with proper stopping conditions

Processing Issues:

  • Environment activation failures: Ensure conda/mamba is properly configured
  • Missing kompot columns: Check if subsampling completed successfully
  • Parameter stability using wrong files: Verify ORIGINAL_INPUT_FILE environment variable

Analysis Issues:

  • Missing columns in stability analysis: Use quick_column_check() to verify available data
  • Correlation computation errors: Check for sufficient non-null data (>10% required)
  • Plot generation failures: Verify parameter value parsing and scientific notation handling

Getting Help

  • Check stage-specific documentation (README_SUBSAMPLING.md, README_GATHER_RESULTS.md)
  • Examine log files in logs/ directory and dataset subdirectories
  • Use --dry-run and --verbose flags for debugging
  • Inspect intermediate outputs with provided utility scripts

License

MIT License

Copyright (c) 2025 Setty Lab

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

Differential expression stability benchmark for Kompot.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published