Tool comparison for detecting differentially expressed individual transposable elements.
This evaluation is based on raw count tables and supports those generated by SalmonTE, SQuIRE, TEtools, Telescope and TEtranscripts.
Detailed information are found here: Locus-specific expression analysis of transposable elements. Project specific programm calls can be found in the Supplemental file 6.
The script simulation_polyester.R can be used to simulate a data set, for which the tool polyester is used. You can set different arguments to get a specific data set that you want. The script needs as input a fasta file that contains the sequences of which reads should be simulated all other arguments are optional.
Rscript simulation_polyester.R --fa <fasta> [--dete <percentage>] [--replicates <replicates>] [--setup <single/paired>] [--length <read length] [--output <outdir>]
| Arguments | Definition |
|---|---|
| --fa | fasta file that contains reference sequences |
| --dete | defines the percentage of elements that are simulated as differentially expressed (default: 5) |
| --replicates | defines number of replicates per condition (default: 5) |
| --setup | defines if a single- or paired-end data set is simulated (options: single, paired; default: single) |
| --length | defines the read length (default: 100) |
| --output | defines the output directory (default: simulated_data_set) |
The tools were run with the default settings, however, some adaption were done for SalmonTE, TEtranscripts and TEtools which are explained more in detail in the publication.
| Tool | Source | DOI |
|---|---|---|
| SalmonTE | https://github.com/LiuzLab/SalmonTE | 10.1142/9789813235533_0016 |
| Telescope | https://github.com/mlbendall/telescope | 10.1101/398172 |
| TEtranscripts | https://github.com/mhammell-laboratory/TEtranscripts | 10.1093/bioinformatics/btv422 |
| SQuIRE | https://github.com/wyang17/SQuIRE | 10.1093/nar/gky1301 |
| TEtools | https://github.com/douglasgscofield/TEtools | 10.1093/nar/gkw953 |
The path of the directories where the results are located, the count tables (in case of SQuIRE the common prefix), and addition files files have to sign in into the dataInfo.csv. The evaluation process needs a reference to compare the results of the tools. These reference has to be stored under Simulation.
Additional Files:
-
SQuIRE needs a 'dictionary' to translate TE ids. This file can be generated with
generateDict.pyand will be explained further down. -
TEtools needs a file where the order of the fastqs are listed (order of the original TEtools call) without the extension
.fastq, e.g.:sample_1 sample_2 sample_3 . . .
When the data is filled in run Rscript TEdetectEval.R to run the evaluation procedure. Subsequently, by running Rscript figures.R and Rscript tables.R the figures and tables were generated.
SQuIRE hast a method to generate a .bed-file where a TE identifier is used which identifies each instance also in the resulting count table. The identifier for each TE is in following format:
chr|start|end|TE-subfamily:TE-family:TE-repclass|score|strand
However, the TE identifier that that is used in the simulated data set is assembled as following:
chr|start|end|TE-repclass|TE-family|TE-subfamily|score|Kimura distance
For the evaluation it is necessary which simulated TE is detected by SQuIRE so that a dictionary is generated to translate the TE ids. Since chr, start and end are unique for each TE these three values are used to translate the ids and to get a table for the TE ids that belonging together.
This can be done with the helper script generateDict.py, which needs as input the bed-file of SQuIRE and your own.
An .align-file generated by RepeatMasker is needed to generate such library. The helper script can be used to generate the reference library. Besides of the align file the reference genome in fasta format is also needed.
bash alignToFasta.sh <.align-file> <referenceGenome.fa>