CVRCseq is a collection of commonly used pipelines integrated into a single workflow via Snakemake. Previously, these existed as individual Snakemake workflows. This unified workflow is designed to run on NYU's UltraViolet HPC, which utilizes Slurm and offers a variety of different node types.
Currently, there are 5 RNA-seq analysis pipelines available:
-
RNAseq_PE
Paired-end data:fastqc → fastp → STAR → featurecounts -
RNAseq_SE
Single-end data:fastqc → fastp → STAR → featurecounts -
RNAseq_HISAT2_stringtie
Paired-end data:fastqc → fastp → HISAT2 → stringtie -
RNAseq_HISAT2_stringtie_nvltrx
Paired-end data:fastqc → fastp → HISAT2 → stringtie → novel transcript identification -
RNAseqTE_PE
Paired-end data:fastqc → fastp → STAR → TEcount
Currently, there is 1 small RNA-seq analysis pipeline available, designed to work with the QIAseq miRNA Library Kit from Qiagen:
- sRNAseq_SE
Single-end data:fastqc → umi-tools → STAR → featurecounts
Currently, there are 3 DNA binding/enrichment pipelines available:
-
ChIPseq_PE
Paired-end data:fastqc → fastp → bowtie2 → macs2 -
CUT-RUN_PE
Paired-end data:fastqc → fastp → bowtie2 → seacr & macs2 -
ATACseq_PE
Paired-end data:fastqc → fastp → bowtie2 → macs2
workflow/Snakefile- Launches individual pipelines located inworkflow/rules
Tab-delimited file containing sample metadata:
- R1 and R2 fastq file names - As received from sequencing center (remove lane numbers
L00Xfor multi-lane samples) - Simple sample names - User-defined identifiers
- Condition - Experimental condition (e.g., diabetic vs non_diabetic)
- Replicate number - Biological replicate identifier
- Antibody column - Required for ChIPseq/CUT-RUN (specifies antibody vs control samples)
- Final sample ID - Concatenation of sample name, condition, replicate, and antibody columns
- Additional metadata - Can be added for downstream analysis
Notes:
cat_rename.pyhandles concatenation of multi-lane fastq files and renaming based on this table.- Paired samples - For ChIPseq/CUT-RUN, sample name/condition/replicate should be identical between antibody and control pairs
Contains general and workflow-specific configuration parameters:
sample_file- Location ofsamples_info.tab(default:config/samples_info.tab)workflow- Name of workflow being usedgenome- Location of indexed genome:- RNAseq_PE/RNAseq_SE/sRNAseq_SE: STAR 2.7.7a index
- HISAT2 workflows: HISAT2 index
- ChIPseq/CUT-RUN/ATACseq: bowtie2 index
GTF- Location of annotation file
CUT-RUN_PE:
spike_genome- Spike-in genome index (bowtie2)chromosome_lengths- Required for spike-in normalization. This file can be found in the STAR genome index folder (chrLength.txt)effective_genome_size- For MACS2
ChIPseq_PE & ATACseq_PE:
effective_genome_size- For MACS2
RNAseq_HISAT2_stringtie variants:
prepDE_length- Average fragment length for stringtie prepDE script
RNAseqTE_PE:
TE_GTF- GTF file with TE annotations (available from MGH lab)
Defines default Slurm resources for each rule.
Preprocessing script that:\
- Concatenates fastq files split across multiple sequencing lanes\
- Renames fastq files from verbose sequencing center IDs to user-defined names\
- Creates new files as
sample_id_Rx.fastq.gz\ - Executed automatically via
snakemake_init.sh\
Skip option: Use
-cflag withsnakemake_init.shto bypass this step.
Main execution script that:\
- Executes
cat_rename.py\ - Loads conda environment\
- Launches Snakemake pipeline\
- Runs MultiQC for quality control\
Launches pipeline from compute node (recommended over login node). Edit the snakemake_init.sh command with desired parameters and submit via sbatch.
Sets environment variables and loads the conda environment.
Computes Fraction of Reads in Peaks (FRP) and outputs a summary table with:\
- FRP values\
- Total fragments\
- Fragments within peaks\
combines counts from TEcount into a single .csv file.
Contains conda environment specifications for the pipeline.
-
Clone repository
git clone https://github.com/mgildea87/CVRCseq.git -
Update sample information
Edit config/samples_info.tab with fastq.gz file names and desired sample, condition, replicate names, and Antibody/IgG control status (if using) -
Configure workflow
Update config.yaml with project-specific settings -
Customize parameters (optional)
Set workflow specific parameters in the appropriate worklow/rules .smk file if desired. e.g. alignment parameters. -
Launch pipeline
bash workflow/scripts/snakemake_init.sh Description of parameters:
-h help"
-d .fastq directory"
-s parameters to pass to snakemake (e.g. --unlock)
-w workflow name (e.g. 'RNAseq_PE')
-c Skip cat_rename.py. Use to skip copying, concatenating, and renaming of .fastq files to the workflow/inputs/fastq/ local directory\
snakemake, STAR, fastqc, fastp, subread - featurecounts, HISAT2, stringtie, TEcount, umi-tools, bowtie2, macs2, seacr