long_read_benchmark

Analysis code and results for the long read benchmark project

Please refer to our preprint for more information

Wirbel*, Hickey*, et al. Discovering Broader Host Ranges and an IS-bound Prophage Class Through Long-Read Metagenomics bioRxiv 2025 https://www.biorxiv.org/content/10.1101/2025.05.09.652943v1

Repository organization

The goal of this repository is to document our analysis approach used in the above-mentioned publication. The analysis involved both raw data processing on a HPC and local data-wrangling/plotting scripts.

Due to the large size of the underlying raw data and computational requirements for many of the analysis steps, this repository does unfortunately not allow for a full, plug-and-play reproduction of the results. For example, the bam files used for identifying structural variants and for computing coverage plots are over 1 TB in size.

However, we tried to document our approach as best as possible and to include all the raw data whenever we could. To reproduce some of the figures in our publication, you can save the raw data tables (they are hosted in this Zenodo repo) into the data folder.

The repo is organized in the following way:

cluster_scripts: Raw data analysis scripts run on our HPC
data: Data folder for large tables (raw data is hosted on Zenodo)
src: Data wrangling and plotting scripts, mostly in R. Each of the subfolders in this folder has it's own README for more explanation
raw_data: Dummy folder containing links to files/folders on the clusters
files: Tables/files needed to reproduce our results or main result tables
figures: Folder for the final figures

Types of analyses

General preprocessing
This part of the repo describes how we got from raw reads to assemblies and phage predictions. The scripts for this part are in the cluster_scripts folder and link to our Nextflow pipeline. These scripts were run on our HPC cluster.
Short-read vs long-read comparison
The scripts in the src/short_read_comparison folder are the starting point for many of the other analyses. They include some very general data collating scripts (sequencing_stats.R, count_phages.R, bin_comparison.R) that produce the tables needed as input for the other scripts, but also scripts to compare how well the phage annotation tools agree (compare_tools.R). The scripts in this folder will produce the display items found in Figure 1 and Supplementary Figures 1-3.
Phage dynamics
The scripts in this folder compares the assemblies of T1 and T2 to evaluate whether phages are stably integrated into the same host or found in different host contexts over time. We used the CheckV companion scripts to cluster phages as well as their host regions within each individual donor. We then combined these clustering results to classify phages as stably integrated or dynamic. The results of these scripts are displayed in Figure 2 and Supplementary Figures 5 & 7.
Structural variation analysis
In these scripts, we get all the structural variations (SVs) identified by Sniffles2 and check if they overlap phage predictions. Based on these exact boundaries, we can evalulate the base-pair prediction error of prediction tools. There is also a script to visualize read alignemnts in an IGV style within R, that is re-used in the phage dynamics and host range scripts. These scripts produce the results shown in Figure 2 and Supplementary Figure 6.
Host range
The scripts in this folder dive deeper into phages that are found in multiple host contexts. For this, we clustered all phages across all individuals with standard cutoffs, using the CheckV companion scripts, and then checked into what type of bacterial hosts they were integrated. The scripts in this folder are the basis for Figure 3 and Supplementary Figures 4 & 8.
IScream phages
The last folder concerns itself with our discovery of IS-bound phages, which are sandwiched by IS30 (like ice cream sandwiches :D ). The analyses here are again a mix of R data wrangling and HPC computing, e.g. for tree construction. These scripts produce the panels in Figure 4 and Supplementary Figures 9 & 10.

Contact

If you have any questions/issues, please feel free open an issue in this repo or contact us directly under wirbel[at]stanford.edu or angelah8[at]stanford.edu

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
cluster_scripts		cluster_scripts
data		data
figures		figures
files		files
raw_data		raw_data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
long_read_benchmark.Rproj		long_read_benchmark.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

long_read_benchmark

Repository organization

Types of analyses

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

bhattlab/long_read_benchmark

Folders and files

Latest commit

History

Repository files navigation

long_read_benchmark

Repository organization

Types of analyses

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages