🐍 CVRCseq 🐍

CVRCseq is a collection of commonly used pipelines integrated into a single workflow via Snakemake. Previously, these existed as individual Snakemake workflows. This unified workflow is designed to run on NYU's UltraViolet HPC, which utilizes Slurm and offers a variety of different node types.

Available Pipelines

🧬 RNA-seq Analysis

Currently, there are 5 RNA-seq analysis pipelines available:

RNAseq_PE
Paired-end data: fastqc → fastp → STAR → featurecounts
RNAseq_SE
Single-end data: fastqc → fastp → STAR → featurecounts
RNAseq_HISAT2_stringtie
Paired-end data: fastqc → fastp → HISAT2 → stringtie
RNAseq_HISAT2_stringtie_nvltrx
Paired-end data: fastqc → fastp → HISAT2 → stringtie → novel transcript identification
RNAseqTE_PE
Paired-end data: fastqc → fastp → STAR → TEcount

🧬 Small RNA-seq Analysis

Currently, there is 1 small RNA-seq analysis pipeline available, designed to work with the QIAseq miRNA Library Kit from Qiagen:

sRNAseq_SE
Single-end data: fastqc → umi-tools → STAR → featurecounts

🧬 DNA Binding/Enrichment Analysis

Currently, there are 3 DNA binding/enrichment pipelines available:

ChIPseq_PE
Paired-end data: fastqc → fastp → bowtie2 → macs2
CUT-RUN_PE
Paired-end data: fastqc → fastp → bowtie2 → seacr & macs2
ATACseq_PE
Paired-end data: fastqc → fastp → bowtie2 → macs2

📁 File Descriptions

Snakefiles

workflow/Snakefile - Launches individual pipelines located in workflow/rules

Configuration Files

`config/samples_info.tab`

Tab-delimited file containing sample metadata:

R1 and R2 fastq file names - As received from sequencing center (remove lane numbers L00X for multi-lane samples)
Simple sample names - User-defined identifiers
Condition - Experimental condition (e.g., diabetic vs non_diabetic)
Replicate number - Biological replicate identifier
Antibody column - Required for ChIPseq/CUT-RUN (specifies antibody vs control samples)
Final sample ID - Concatenation of sample name, condition, replicate, and antibody columns
Additional metadata - Can be added for downstream analysis

Notes:

cat_rename.py handles concatenation of multi-lane fastq files and renaming based on this table.
Paired samples - For ChIPseq/CUT-RUN, sample name/condition/replicate should be identical between antibody and control pairs

`config/config.yaml`

Contains general and workflow-specific configuration parameters:

Generic Requirements:

sample_file - Location of samples_info.tab (default: config/samples_info.tab)
workflow - Name of workflow being used
genome - Location of indexed genome:
- RNAseq_PE/RNAseq_SE/sRNAseq_SE: STAR 2.7.7a index
- HISAT2 workflows: HISAT2 index
- ChIPseq/CUT-RUN/ATACseq: bowtie2 index
GTF - Location of annotation file

Workflow-Specific config settings:

CUT-RUN_PE:

spike_genome - Spike-in genome index (bowtie2)
chromosome_lengths - Required for spike-in normalization. This file can be found in the STAR genome index folder (chrLength.txt)
effective_genome_size - For MACS2

ChIPseq_PE & ATACseq_PE:

effective_genome_size - For MACS2

RNAseq_HISAT2_stringtie variants:

prepDE_length - Average fragment length for stringtie prepDE script

RNAseqTE_PE:

TE_GTF - GTF file with TE annotations (available from MGH lab)

`config/profile/config.yaml`

Defines default Slurm resources for each rule.

📝 Scripts

`workflow/scripts/cat_rename.py`

Preprocessing script that:\

Concatenates fastq files split across multiple sequencing lanes\
Renames fastq files from verbose sequencing center IDs to user-defined names\
Creates new files as sample_id_Rx.fastq.gz\
Executed automatically via snakemake_init.sh\

Skip option: Use -c flag with snakemake_init.sh to bypass this step.

`workflow/scripts/snakemake_init.sh`

Main execution script that:\

Executes cat_rename.py\
Loads conda environment\
Launches Snakemake pipeline\
Runs MultiQC for quality control\

`workflow/scripts/launch_sbatch.sh`

Launches pipeline from compute node (recommended over login node). Edit the snakemake_init.sh command with desired parameters and submit via sbatch.

`workflow/scripts/condaload_CVRCseq.sh`

Sets environment variables and loads the conda environment.

`workflow/scripts/FRP.py`

Computes Fraction of Reads in Peaks (FRP) and outputs a summary table with:\

FRP values\
Total fragments\
Fragments within peaks\

`workflow/scripts/combine_TE_counts.py`

combines counts from TEcount into a single .csv file.

Environment

`workflow/envs/CVRCseq.yml`

Contains conda environment specifications for the pipeline.

🚀 Usage Instructions

Getting Started

Clone repository
git clone https://github.com/mgildea87/CVRCseq.git
Update sample information
Edit config/samples_info.tab with fastq.gz file names and desired sample, condition, replicate names, and Antibody/IgG control status (if using)
Configure workflow
Update config.yaml with project-specific settings
Customize parameters (optional)
Set workflow specific parameters in the appropriate worklow/rules .smk file if desired. e.g. alignment parameters.
Launch pipeline
bash workflow/scripts/snakemake_init.sh Description of parameters:
-h help"
-d .fastq directory"
-s parameters to pass to snakemake (e.g. --unlock)
-w workflow name (e.g. 'RNAseq_PE')
-c Skip cat_rename.py. Use to skip copying, concatenating, and renaming of .fastq files to the workflow/inputs/fastq/ local directory\

Software links

snakemake, STAR, fastqc, fastp, subread - featurecounts, HISAT2, stringtie, TEcount, umi-tools, bowtie2, macs2, seacr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🐍 CVRCseq 🐍

Available Pipelines

🧬 RNA-seq Analysis

🧬 Small RNA-seq Analysis

🧬 DNA Binding/Enrichment Analysis

📁 File Descriptions

Snakefiles

Configuration Files

`config/samples_info.tab`

`config/config.yaml`

Generic Requirements:

Workflow-Specific config settings:

`config/profile/config.yaml`

📝 Scripts

`workflow/scripts/cat_rename.py`

`workflow/scripts/snakemake_init.sh`

`workflow/scripts/launch_sbatch.sh`

`workflow/scripts/condaload_CVRCseq.sh`

`workflow/scripts/FRP.py`

`workflow/scripts/combine_TE_counts.py`

Environment

`workflow/envs/CVRCseq.yml`

🚀 Usage Instructions

Getting Started

Software links

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
config		config
workflow		workflow
README.md		README.md

mgildea87/CVRCseq

Folders and files

Latest commit

History

Repository files navigation

🐍 CVRCseq 🐍

Available Pipelines

🧬 RNA-seq Analysis

🧬 Small RNA-seq Analysis

🧬 DNA Binding/Enrichment Analysis

📁 File Descriptions

Snakefiles

Configuration Files

config/samples_info.tab

config/config.yaml

Generic Requirements:

Workflow-Specific config settings:

config/profile/config.yaml

📝 Scripts

workflow/scripts/cat_rename.py

workflow/scripts/snakemake_init.sh

workflow/scripts/launch_sbatch.sh

workflow/scripts/condaload_CVRCseq.sh

workflow/scripts/FRP.py

workflow/scripts/combine_TE_counts.py

Environment

workflow/envs/CVRCseq.yml

🚀 Usage Instructions

Getting Started

Software links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

`config/samples_info.tab`

`config/config.yaml`

`config/profile/config.yaml`

`workflow/scripts/cat_rename.py`

`workflow/scripts/snakemake_init.sh`

`workflow/scripts/launch_sbatch.sh`

`workflow/scripts/condaload_CVRCseq.sh`

`workflow/scripts/FRP.py`

`workflow/scripts/combine_TE_counts.py`

`workflow/envs/CVRCseq.yml`

Packages