Skip to content

comprna/modkitopt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ModkitOpt

ModkitOpt finds the best --mod-threshold and --filter-threshold parameters to use when running modkit pileup, and the best stoichiometry cutoff for filtering modkit's bedMethyl output, to maximise the precision and recall of your nanopore direct RNA modification calls.

Why use ModkitOpt?

By default, modkit filters out dorado's low-confidence modification calls using a heuristic to estimate the confidence threshold. The heuristic is not based on prediction accuracy, so datasets dominated by low-confidence calls, such as those for rare modifications like pseudouridine, get assigned insufficiently stringent thresholds, resulting in elevated false discovery rates, while datasets dominated by high-confidence calls are filtered too stringently, excluding true sites.

We show in our manuscript (referenced below) that the default modkit performance is frequently suboptimal, producing incorrect results, and that running modkit with systematically identified optimal thresholds rescues modkit performance and substantially improves modification-calling accuracy.

How ModkitOpt works

ModkitOpt takes as input a modBAM file containing dorado per-read modification calls, efficiently and systematically scans 36,000 combinations of modkit thresholds (--filter-threshold and --mod-threshold) and downstream stoichiometry cutoffs, and evaluates predicted sites against validated reference sites to quantify precision and recall. ModkitOpt identifies the optimal threshold combination, and corresponding stoichiometry cutoff, that maximises the F1 score ( $2 \cdot precision \cdot recall/(precision+recall)$ ).

Validated reference sites are supplied for mammalian N6-methyladenosine (m6A) and pseudouridine (pseU), which can be used for nanopore datasets that originate from a different biological sample, provided a subset of sites are shared with the reference. For other modification types, a reference set can be supplied by the user.

Citation

If you use this software, please cite:

Sneddon, Prodic & Eyras. (2025). ModkitOpt: Systematic optimisation of modkit parameters for accurate nanopore-based RNA modification detection. bioRxiv preprint. DOI: 10.64898/2025.12.19.695383

Quick start

To run locally using the example modBAM file we provide in modkitopt/resources, simply:

1. Clone the repository

git clone https://github.com/comprna/modkitopt.git

2. Install dependencies

Nextflow will automatically install all other dependencies using conda (environment defined in modkitopt/env.yaml) the first time that modkitopt is run.

3. Run ModkitOpt

Run ModkitOpt on an example modBAM file containing m6A calls aligned to the human reference transcriptome (GRCh38.p14, release 45). To run this example, you need to use the same GENCODE human reference transcriptome and corresponding annotation (GRCh38.p14, release 45) that was used to create the modBAM file.

Important note: This example demonstrates how the Nextflow pipeline operates, but the modBAM file is too small for ModkitOpt to provide a meaningful output.

cd /path/to/modkitopt

nextflow run main.nf                                          \
  --modbam           ./resources/example.bam                  \
  --mod_type         m6A                                      \
  --modkit           /path/to/modkit                          \
  --fasta            /path/to/gencode.v45.transcripts.fa      \
  --annotation       /path/to/gencode.v45.annotation.gff3     \
  -profile local

Note: The first time that you run ModkitOpt, Nextflow will create a conda environment and install dependencies - be patient, this will take a few minutes.

Running in HPC environments

We recommend running ModkitOpt in an HPC environment, since modkit is called several times with different thresholds. Nextflow handles submitting modkit jobs so that they can run at the same time, reducing the overall execution time of ModkitOpt.

Dependencies

The modkit binary that you downloaded and extracted in Quick start can simply be copied to your HPC storage location.

Nextflow and conda are often already provided in HPC environments as modules that can simply be loaded. If not, they need to be installed following the guidelines for your system.

Running ModkitOpt

The first time you run ModkitOpt in an HPC environment

Before running ModkitOpt inside a job, first run it on a login node (or a node where internet is available) so that Nextflow can create the conda environment. Once the conda environment is created and the Nextflow pipeline starts executing, you can kill the pipeline and then proceed with submitting your ModkitOpt job.

Tested HPC environments

We have only tested ModkitOpt in a pbspro environment (NCI's gadi).

Your specific HPC system may use different Nextflow directives, these can be updated in modkitopt/profiles/pbspro.config.

While we have written profiles for pbs and slurm, these have not been tested. We welcome contributions from the community to improve these profiles, which can be found in modkitopt/profiles/.

If you aren't familiar with your system's expected Nextflow directives, you can also run ModkitOpt using -profile local.

Specifying your HPC environment details

When running in an HPC environment, you need to specify these things:

1. Your HPC environment profile

This tells Nextflow what type of workload manager it is dealing with. We currently support PBS, PBS Pro and Slurm systems. Specify this with the -profile flag, such as -profile pbs, -profile pbspro or -profile slurm. For NCI's gadi use -profile pbspro. Nextflow automatically handles creating and submitting jobs in each of these environments.

2. Your HPC queue name

You must specify the queue that Nextflow can schedule jobs to using the --hpc_queue flag, such as --hpc_queue normal. Since Nextflow only requires CPUs to execute, and up to 30GB memory for some of the tasks, the standard queue should suffice.

3. Your HPC project code

You must specify the HPC project code that Nextflow can schedule jobs to using the --hpc_project flag, such as --hpc_project ab12.

4. Your HPC storage location

You must specify your HPC storage location using the --hpc_storage flag. This should list all storage locations for your input files, conda environment, and the modkit repo, such as --hpc_storage gdata/ab12+gdata/cd34+scratch/ab12.

Command example

Briefly, the required input files are:

  1. modBAM file output by dorado
  2. FASTA and GTF/GFF3 annotation for modkit to use (the same FASTA as you provided to dorado, with corresponding GTF/GFF3 annotation).
  3. TSV file containing ground truth sites (optional if your nanopore dataset is mammalian and your modification type is m6A or pseU)

See Command details for more information.

nextflow run main.nf                                           \
  --modbam          /path/to/modbam.bam                        \
  --mod_type        m6A                                        \
  --modkit          /path/to/modkit                            \
  --fasta           /path/to/ref.fa                            \
  --annotation      /path/to/annotation.gff3                   \
  -profile          pbspro                                     \
  --hpc_queue       normal                                     \
  --hpc_project     ab12                                       \
  --hpc_storage     gdata/ab12

Nextflow crashing? Try running with local profile

If you aren't familiar with your system's expected Nextflow directives, or Nextflow is having trouble creating jobs, you can also run ModkitOpt using -profile local in a job script.

Using the local profile means that Nextflow won't spawn jobs to run processes in parallel, so it may take a little longer to run but will produce the same results. Using this approach, your job needs at least 8 CPUs, at least 30GB of RAM and at least 5GB of job filesystem disk space.

cd /path/to/modkitopt

nextflow run main.nf                                          \
  --modbam           ./resources/example.bam                  \
  --mod_type         m6A                                      \
  --modkit           /path/to/modkit                          \
  --fasta            /path/to/gencode.v45.transcripts.fa      \
  --annotation       /path/to/gencode.v45.annotation.gff3     \
  -profile local

Resuming an interrupted run

If your run gets interrupted, Nextflow automatically supports checkpointing and resuming runs. Simply add -resume to the Nextflow command that didn't complete and run again!

For example:

nextflow run main.nf                                           \
  --modbam          /path/to/modbam.bam                        \
  --mod_type        m6A                                        \
  --modkit          /path/to/modkit                            \
  --fasta           /path/to/ref.fa                            \
  --annotation      /path/to/annotation.gff3                   \
  -profile          pbspro                                     \
  --hpc_queue       normal                                     \
  --hpc_project     ab12                                       \
  --hpc_storage     gdata/ab12                                 \
  -resume

Command details

Usage:
The typical command structure for running the pipeline is as follows:
nextflow run main.nf --modbam sample.bam
                      --mod_type <m6A|pseU|m5C|inosine>
                      --modkit /path/to/modkit
                      --fasta /path/to/transcriptome.fa
                      --annotation /path/to/annotation.gff3
                      -profile <local|pbs|pbspro|slurm>

Mandatory arguments:
  --modbam             .bam file containing per-read modification calls
  --mod_type           Modification type (options: m6A, pseU, m5C, inosine)
  --modkit             Path to modkit executable
  --fasta              Path to reference transcriptome
  --annotation         Path to corresponding reference annotation (.gtf or .gff3)
  -profile             Execution environment (options: local, pbs, pbspro, slurm)

Mandatory arguments if running on an HPC system (-profile is pbs, pbspro or slurm):
  --hpc_queue          Name of the queue that Nextflow can schedule jobs to (e.g., 'normal')
  --hpc_project        HPC project code that Nextflow can schedule jobs to (e.g., 'ab12')
  --hpc_storage        HPC storage location that outputs can be written to (e.g., 'gdata/ab12')
  --help               This usage statement

Optional arguments:
  --truth_sites        .tsv file containing known modification sites (genomic 1-based coordinates, expected columns 1 and 2: [chr, pos], mandatory if mod_type is m5C or inosine)

Estimated run-time

We tested the execution time of ModkitOpt on an HPC system (NCI's gadi) with a PBSPro scheduler and the default resource settings:

  • 8 CPUs & 8GB RAM for samtools filter and samtools sort
  • 8 CPUs & 30GB RAM for modkit pileup
  • 4 CPUs & 8GB RAM for samtools index
  • 1 CPU, 30GB RAM and 5GB job filesystem disk space for converting transcriptomic to genomic coordinates
  • 1 CPU with 8GB RAM for all other tasks
modBAM file size Mod type Cell line Execution time
7.4GB pseU HeLa 22 mins
10.8GB pseU HepG2 27 mins
12.4GB pseU K562 30 mins
13.1GB m6A HepG2 29 mins
13.8GB m6A K562 31 mins
21GB m6A HEK293T 33 mins

Tested environments and software versions

The following configurations have been tested:

Execution environment Modkit version Nextflow version
pbspro (NCI's gadi) v0.6.0 24.04.5
local v0.6.0 25.10.0

Advanced use

Overriding the default Nextflow parameters

The default Nextflow parameters, contained in nextflow.config and profiles/ can be overridden on the command-line.

For example, to increase the number of CPUs used for modkit pileup, you simply add --pileup_cpus 16 to your Nextflow command.

About

Systematic modkit parameter optimisation for accurate nanopore-based RNA modification calling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published