Skip to content

AlejandroAb/CASCABEL

Repository files navigation

Cascabel

Cascabel is a pipeline designed to run amplicon sequence analysis across single or multiple read libraries. The objective of this pipeline is to create different output files which allow the user to explore data in a simple and meaningful way, as well as facilitate downstream analysis, based on the generated output files.

CASCABEL was designed for short read high-throughput sequence data. It covers quality control on the fastq files, assembling paired-end reads to fragments (it can also handle single end data), splitting the libraries into samples (optional), OTU picking and taxonomy assignment. Besides other output files, it will return an OTU table.

Our pipeline is implemented with Snakemake as workflow management engine and allows customizing the analyses by offering several choices for most of the steps. The pipeline can make use of multiple computing nodes and scales from personal computers to computing servers. The analyses and results are fully reproducible and documented in an html and optional pdf report.

Current version: 7.0.0

Installation

The easiest and recommended way to do install Cascabel is via Conda. The fastest way to obtain Conda is to install Miniconda, a mini version of Anaconda that includes only conda and its dependencies.

Miniconda

In order to install conda or miniconda please see the following tutorial (recommended) or, if you are working with a Linux OS, you can try the following:

Download the installer:


wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Execute the installation script and follow the instructions.


bash Miniconda3-latest-Linux-x86_64.sh

Download CASCABEL

Once that you have conda installed we are ready to clone or download the project.

You can clone the project:


git clone https://github.com/AlejandroAb/CASCABEL.git

Or download it from this repository:


wget https://github.com/AlejandroAb/CASCABEL/archive/master.zip

After downloading or cloning the repository, cd to the "CASCABEL" directory and there execute the following command in order to create CASCABEL's environment:


conda env create --name cascabel --file environment.yaml

Activate environment

Now you can activate your new environment.


conda activate cascabel

After activating the environment, if you have the environmental variable PER5LIB preconfigured, it is possinle that you will need to change it. To avoid any issue, configre PER5LIB path as follow:_


export  PERL5LIB=/path/to/conda/.conda/envs/cascabel/perl5

Just make sure to change /path/to/conda/ for the correct path on your system

To identify your path, you can use the followinig command:


which snakemake

Dada2

There are some issues reported while installing dada2 within conda, if you are experiencing issues you need to perform one more final step in order to install dada2

Enter to R shell (just type R) and execute the following command:


BiocManager::install("dada2", version = "3.10")

*Please notice that BiocManager should be already installed, so you just need to execute previous command. You can also find more information at [dada2's installation guide.](https://benjjneb.github.io/dada2/dada-i nstallation.html)


BiocManager::install("dada2", version = "3.10")

*Please notice that BiocManager should be already installed, so you just need to execute previous command. You can also find more information at dada2's installation guide.

Getting started

Required input files:

  • Forward raw reads (fastq or fastq.gz)
  • Reverse raw reads (fastq or fastq.gz) (only for paired-end layout)
  • File with barcode information (only for demultiplexing: format)

Main expected output files for downstream analysis

  • Demultiplexed and trimmed reads
  • OTU or ASV table
  • Representative sequences fasta file
  • Taxonomy OTU assignation
  • Taxonomy summary
  • Representative sequence alignment
  • Phylogenetic tree
  • CASCABEL Report

Run Cascabel

All the parameters and behavior of the workflow is specified through the configuration file, therefore the easiest way to have the pipeline running is to filling up some required parameters on such file.

#------------------------------------------------------------------------------#
#                             Project Name                                     #
#------------------------------------------------------------------------------#
# The name of the project for which the pipeline will be executed. This should #
# be the same name used as the first parameter on init_sample.sh script (if    #
# used for multiple libraries                                                 #
#------------------------------------------------------------------------------#
PROJECT: "My_CASCABEL_Project"

#------------------------------------------------------------------------------#
#                            LIBRARIES/SAMPLES                                 #
#------------------------------------------------------------------------------#
# SAMPLES/LIBRARIES you want to include in the analysis.                       #
# Use the same library names as with the init_sample.sh script.                #
# Include each library name surrounded by quotes, and comma separated.         #
# i.e LIBRARY:  ["LIB_1","LIB_2",..."LIB_N"]                                   #
# LIBRARY_LAYOUT: Configuration of the library; all the libraries/samples      #
#                 must have the same configuration; use:                       #
#                 "PE" for paired-end reads [Default].                         #
#                 "SE" for single-end reads.                                   #
#------------------------------------------------------------------------------#
LIBRARY: ["EXP1"]
LIBRARY_LAYOUT: "PE"

#------------------------------------------------------------------------------#
#                             INPUT FILES                                      #
#------------------------------------------------------------------------------#
# To run Cascabel for multiple libraries you can provide an input file, tab    #
# separated with the following columns:                                        #
# - Library: Name of the library (this have to match with the values entered   #
#            in the LIBRARY variable described above).                         #
# - Forward reads: Full path to the forward reads.                             #
# - Reverse reads: Full path to the reverse reads (only for paired-end).       #
# - metadata:      Full path to the file with the information for              #
#                  demultiplexing the samples (only if needed).                #
# The full path of this file should be supplied in the input_files variable,   #
# otherwise, you have to enter the FULL PATH for both: the raw reads and the   #
# metadata file (barcode mapping file). The metadata file is only needed if    #
# you want to perform demultiplexing.                                          #
# If you want to avoid the creation of this file a third solution is available #
# using the script init_sample.sh. More info at the project Wiki:              #
# https://github.com/AlejandroAb/CASCABEL/wiki#21-input-files                  #
#                                                                              #
#-----------------------------       PARAMS       -----------------------------#
#                                                                              #
# - fw_reads:  Full path to the raw reads in forward direction (R1)            #
# - rw_reads:  Full path to the raw reads in reverse direction (R2)            #
# - metadata:  Full path to the metadata file with barcodes for each sample    #
#              to perform library demultiplexing                               #
# - input_files: Full path to a file with the information for the library(s)   #
#                                                                              #
# ** Please supply only one of the following:                                  #
#     - fw_reads, rv_reads and metadata                                        #
#     - input_files                                                            #
#     - or use init_sample.sh script directly                                  #
#------------------------------------------------------------------------------#
fw_reads: "/full/path/to/forward.reads.fq"
rv_reads: "/full/path/to/reverse.reads.fq"
metadata: "/full/path/to/metadata.barcodes.txt"
#or
input_files: "/full/path/to/input_reference.txt"

#------------------------------------------------------------------------------#
#  ASV_WF:             Binned qualities and Big data workflow                  #
#------------------------------------------------------------------------------#
# For fastq files with binned qualities (e.g. NovaSeq and NextSeq) the error   #
# learning process within dada2 can be affected, and some data scientists      #
# suggest that enforcing monotonicity could be beneficial for the analysis.    #
# In this section, you can modify key parameters to enforce monotonicity and   #
# also go through a big data workflow when the number of reads may exceed the  #
# physical memory limit.
# More on binned qualities: https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf
# You can also follow this excellent thread about binned qualities and Dada2: https://forum.qiime2.org/t/novaseq-and-dada2-incompatibility/25865/8
#------------------------------------------------------------------------------#
binned_q_scores: "F" #Binned quality scores.Set this to "T" if you want to enforce monotonicity 
big_data_wf: "F" #Set to true when your sequencing run contains more than 10^9 reads (depends on RAM availability!)


#------------------------------------------------------------------------------#
#                               RUN                                            #
#------------------------------------------------------------------------------#
# Name of the RUN - Only use alphanumeric characters and don't use spaces.     #
# This parameter helps the user to execute different runs (pipeline executions)#
# with the same input data but with different parameters (ideally).            #
# The RUN parameter can be set here or remain empty, in the latter case, the   #
# user must assign this value via the command line.                            #
# i.e:  --config RUN=run_name                                                  #
#------------------------------------------------------------------------------#
RUN: "My_First_run"

#------------------------------------------------------------------------------#
#                                 ANALYSIS TYPE                                #
# rules:                                                                       #
#------------------------------------------------------------------------------#
# Cascabel supports two main types of analysis:                                #
#  1) Analysis based on traditional OTUs (Operational Taxonomic Units) which   #
#     are mainly generated by clustering sequences based on a sheared          #
#     similarity threshold.                                                    #
#  2) Analysis based on ASV (Amplicon sequence variant). This kind of analysis #
#     deal also with the errors on the sequence reads such that true sequence  #
#     variants can be resolved, down to the level of single-nucleotide         #
#     differences.                                                             #
#                                                                              #
#-----------------------------       PARAMS       -----------------------------#
#                                                                              #
# - ANALYSIS_TYPE    "OTU" or "ASV". Defines the type analysis                 #
#------------------------------------------------------------------------------#
ANALYSIS_TYPE: "OTU"

For more information about how to supply this data, please follow the link for detailed instructions

As you can see on the previous fragment of the configuration file (config.yaml), the required parameters for CASCABEL to start are: PROJECT, LIBRARY, RUN, fw_reads, rv_reads and metadata. After entering these parameters, take some minutes and go through the rest of the config file and overwrite settings according to your needs. Most values are already pre-configured. The config file explains itself by using meaningful headers before each rule, explaining the aim of such rule and the different parameters the user can use. It is very important to keep the indentation of the file (don’t change the tabs and spaces), as well as the name of the parameters. Once that you have valid values for these entries, you are ready to run the pipeline (before start CASCABEL always is a good practice to make a "dry run"):

Also, please notice the ANALYSIS TYPE section. Cascabel, supports two main type of analysis, OTUs (Operational Taxonomic Units) and ASVs (Amplicon Sequence Variants), here you can select the target workflow that Cascabel will execute. For more information pleasee refer to the Analysis type section

Run Cascabel

Once everything is in place, just run Cascabel with the following command:


snakemake --configfile config.yaml -j1 -c20

If you run Cascabel interactively, specify only one job: -j1 so that you can correctly interact with the pipeline in the designated breakingpoints.

Adjust the option -c20 to designate the maximum number of cores to use within the pipeline rules. In this example, we are using 20.

Optionally you can specify the same parameters* via --config flag, rather than within the config.yaml file:


 snakemake --configfile config.yaml --config PROJECT="My_CASCABEL_Project"  RUN="My_First_run" fw_reads="//full/path/to/forward.reads.fq" rv_reads="/full/path/to/reverse.reads.fq" metadata="full/path/to/metadata.barcodes.txt"

*Except for the LIBRARY, as this is declared as an array, therefore it must be filled up within the configuration file

Create the report

The report is one of the most important parts of the pipeline, as this will show the executed commands, software versions, read cleaning and read filtering results, as well as point and even embed some of the main output files (such as the OTU table), making the results more portable and easy to back up after a completed run.

To create this report, while located in the same directory used for the run, use the same configuration file as for the run, and execute the following command:


snakemake --configfile config.yaml --report name_your_report.zip

The resultant file name_your_report.zip will contain everything you need.

Configure pipeline

For a complete guide on how to setup and use CASCABEL please visit the official project wiki

Configuration files

We supply some "pre-filled" configuration files for the main possible configurations like for double and single barcoded paired end reads for OTU and ASV analysis. We strongly advise to make informed choices about parameter settings matching the individual needs of the experiment and data set.

  • config.otu.double_bc.yaml. Configuration file for paired-end data, barcodes on both reads, OTU analysis.
  • config.asv.double_bc.yaml. Configuration file for paired-end data, barcodes on both reads, ASV analysis.
  • config.otu.double_bc.unpaired.yaml. Configuration file for paired-end data, barcodes on both reads, OTU analysis, unpaired workflow, taxonomy assignation with RDP
  • config.asv.double_bc.unpaired.yaml. Configuration file for paired-end data, barcodes on both reads, ASV analysis, unpaired workflow.

Test data

In order to test the pipeline we also sugest to try running it with CASCABEL's test data

Barcode mapping file example

Citing

Cascabel: a scalable and versatile amplicon sequence data analysis pipeline delivering reproducible and documented results. Alejandro Abdala Asbun, Marc A Besseling, Sergio Balzano, Judith van Bleijswijk, Harry Witte, Laura Villanueva, Julia C Engelmann Front. Genet.; doi: https://doi.org/10.3389/fgene.2020.489357

About

Automated pipeline for amplicon sequence analysis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published