Analysis code and results for the long read benchmark project
Please refer to our preprint for more information
Wirbel*, Hickey*, et al. Discovering Broader Host Ranges and an IS-bound Prophage Class Through Long-Read Metagenomics bioRxiv 2025 https://www.biorxiv.org/content/10.1101/2025.05.09.652943v1
The goal of this repository is to document our analysis approach used in the above-mentioned publication. The analysis involved both raw data processing on a HPC and local data-wrangling/plotting scripts.
Due to the large size of the underlying raw data and computational requirements for many of the analysis steps, this repository does unfortunately not allow for a full, plug-and-play reproduction of the results. For example, the bam files used for identifying structural variants and for computing coverage plots are over 1 TB in size.
However, we tried to document our approach as best as possible and to
include all the raw data whenever we could. To reproduce some of the figures
in our publication, you can save the raw data tables (they are hosted in this
Zenodo repo) into the data folder.
The repo is organized in the following way:
cluster_scripts: Raw data analysis scripts run on our HPCdata: Data folder for large tables (raw data is hosted on Zenodo)src: Data wrangling and plotting scripts, mostly in R. Each of the subfolders in this folder has it's own README for more explanationraw_data: Dummy folder containing links to files/folders on the clustersfiles: Tables/files needed to reproduce our results or main result tablesfigures: Folder for the final figures
-
General preprocessing
This part of the repo describes how we got from raw reads to assemblies and phage predictions. The scripts for this part are in thecluster_scriptsfolder and link to our Nextflow pipeline. These scripts were run on our HPC cluster. -
Short-read vs long-read comparison
The scripts in thesrc/short_read_comparisonfolder are the starting point for many of the other analyses. They include some very general data collating scripts (sequencing_stats.R,count_phages.R,bin_comparison.R) that produce the tables needed as input for the other scripts, but also scripts to compare how well the phage annotation tools agree (compare_tools.R). The scripts in this folder will produce the display items found in Figure 1 and Supplementary Figures 1-3. -
Phage dynamics
The scripts in this folder compares the assemblies of T1 and T2 to evaluate whether phages are stably integrated into the same host or found in different host contexts over time. We used the CheckV companion scripts to cluster phages as well as their host regions within each individual donor. We then combined these clustering results to classify phages as stably integrated or dynamic. The results of these scripts are displayed in Figure 2 and Supplementary Figures 5 & 7. -
Structural variation analysis
In these scripts, we get all the structural variations (SVs) identified by Sniffles2 and check if they overlap phage predictions. Based on these exact boundaries, we can evalulate the base-pair prediction error of prediction tools. There is also a script to visualize read alignemnts in an IGV style within R, that is re-used in the phage dynamics and host range scripts. These scripts produce the results shown in Figure 2 and Supplementary Figure 6. -
Host range
The scripts in this folder dive deeper into phages that are found in multiple host contexts. For this, we clustered all phages across all individuals with standard cutoffs, using the CheckV companion scripts, and then checked into what type of bacterial hosts they were integrated. The scripts in this folder are the basis for Figure 3 and Supplementary Figures 4 & 8. -
IScream phages
The last folder concerns itself with our discovery of IS-bound phages, which are sandwiched by IS30 (like ice cream sandwiches :D ). The analyses here are again a mix of R data wrangling and HPC computing, e.g. for tree construction. These scripts produce the panels in Figure 4 and Supplementary Figures 9 & 10.
If you have any questions/issues, please feel free open an issue in this repo or contact us directly under wirbel[at]stanford.edu or angelah8[at]stanford.edu