ModkitOpt finds the best --mod-threshold and --filter-threshold parameters to use when running modkit pileup, and the best stoichiometry cutoff for filtering modkit's bedMethyl output, to maximise the precision and recall of your nanopore direct RNA modification calls.
By default, modkit filters out dorado's low-confidence modification calls using a heuristic to estimate the confidence threshold. The heuristic is not based on prediction accuracy, so datasets dominated by low-confidence calls, such as those for rare modifications like pseudouridine, get assigned insufficiently stringent thresholds, resulting in elevated false discovery rates, while datasets dominated by high-confidence calls are filtered too stringently, excluding true sites.
We show in our manuscript (referenced below) that the default modkit performance is frequently suboptimal, producing incorrect results, and that running modkit with systematically identified optimal thresholds rescues modkit performance and substantially improves modification-calling accuracy.
ModkitOpt takes as input a modBAM file containing dorado per-read modification calls, efficiently and systematically scans 36,000 combinations of modkit thresholds (--filter-threshold and --mod-threshold) and downstream stoichiometry cutoffs, and evaluates predicted sites against validated reference sites to quantify precision and recall. ModkitOpt identifies the optimal threshold combination, and corresponding stoichiometry cutoff, that maximises the F1 score (
Validated reference sites are supplied for mammalian N6-methyladenosine (m6A) and pseudouridine (pseU), which can be used for nanopore datasets that originate from a different biological sample, provided a subset of sites are shared with the reference. For other modification types, a reference set can be supplied by the user.
If you use this software, please cite:
Sneddon, Prodic & Eyras. (2025). ModkitOpt: Systematic optimisation of modkit parameters for accurate nanopore-based RNA modification detection. bioRxiv preprint. DOI: 10.64898/2025.12.19.695383
To run locally using the example modBAM file we provide in modkitopt/resources, simply:
1. Clone the repository
git clone https://github.com/comprna/modkitopt.git
2. Install dependencies
-
modkit >= v0.6.0
- Download modkit_vXYZ.tar.gz from the modkit release page
- Extract the archive contents
tar -xvzf modkit_vXYZ.tar.gz
-
conda (Miniconda installation guide)
-
nextflow (installation guide)
Nextflow will automatically install all other dependencies using conda (environment defined in modkitopt/env.yaml) the first time that modkitopt is run.
3. Run ModkitOpt
Run ModkitOpt on an example modBAM file containing m6A calls aligned to the human reference transcriptome (GRCh38.p14, release 45). To run this example, you need to use the same GENCODE human reference transcriptome and corresponding annotation (GRCh38.p14, release 45) that was used to create the modBAM file.
Important note: This example demonstrates how the Nextflow pipeline operates, but the modBAM file is too small for ModkitOpt to provide a meaningful output.
cd /path/to/modkitopt
nextflow run main.nf \
--modbam ./resources/example.bam \
--mod_type m6A \
--modkit /path/to/modkit \
--fasta /path/to/gencode.v45.transcripts.fa \
--annotation /path/to/gencode.v45.annotation.gff3 \
-profile localNote: The first time that you run ModkitOpt, Nextflow will create a conda environment and install dependencies - be patient, this will take a few minutes.
We recommend running ModkitOpt in an HPC environment, since modkit is called several times with different thresholds. Nextflow handles submitting modkit jobs so that they can run at the same time, reducing the overall execution time of ModkitOpt.
The modkit binary that you downloaded and extracted in Quick start can simply be copied to your HPC storage location.
Nextflow and conda are often already provided in HPC environments as modules that can simply be loaded. If not, they need to be installed following the guidelines for your system.
Before running ModkitOpt inside a job, first run it on a login node (or a node where internet is available) so that Nextflow can create the conda environment. Once the conda environment is created and the Nextflow pipeline starts executing, you can kill the pipeline and then proceed with submitting your ModkitOpt job.
We have only tested ModkitOpt in a pbspro environment (NCI's gadi).
Your specific HPC system may use different Nextflow directives, these can be updated in modkitopt/profiles/pbspro.config.
While we have written profiles for pbs and slurm, these have not been tested. We welcome contributions from the community to improve these profiles, which can be found in modkitopt/profiles/.
If you aren't familiar with your system's expected Nextflow directives, you can also run ModkitOpt using -profile local.
When running in an HPC environment, you need to specify these things:
1. Your HPC environment profile
This tells Nextflow what type of workload manager it is dealing with. We currently support PBS, PBS Pro and Slurm systems. Specify this with the -profile flag, such as -profile pbs, -profile pbspro or -profile slurm. For NCI's gadi use -profile pbspro. Nextflow automatically handles creating and submitting jobs in each of these environments.
2. Your HPC queue name
You must specify the queue that Nextflow can schedule jobs to using the --hpc_queue flag, such as --hpc_queue normal. Since Nextflow only requires CPUs to execute, and up to 30GB memory for some of the tasks, the standard queue should suffice.
3. Your HPC project code
You must specify the HPC project code that Nextflow can schedule jobs to using the --hpc_project flag, such as --hpc_project ab12.
4. Your HPC storage location
You must specify your HPC storage location using the --hpc_storage flag. This should list all storage locations for your input files, conda environment, and the modkit repo,
such as --hpc_storage gdata/ab12+gdata/cd34+scratch/ab12.
Briefly, the required input files are:
- modBAM file output by dorado
- FASTA and GTF/GFF3 annotation for modkit to use (the same FASTA as you provided to dorado, with corresponding GTF/GFF3 annotation).
- TSV file containing ground truth sites (optional if your nanopore dataset is mammalian and your modification type is m6A or pseU)
See Command details for more information.
nextflow run main.nf \
--modbam /path/to/modbam.bam \
--mod_type m6A \
--modkit /path/to/modkit \
--fasta /path/to/ref.fa \
--annotation /path/to/annotation.gff3 \
-profile pbspro \
--hpc_queue normal \
--hpc_project ab12 \
--hpc_storage gdata/ab12If you aren't familiar with your system's expected Nextflow directives, or Nextflow is having trouble creating jobs, you can also run ModkitOpt using -profile local in a job script.
Using the local profile means that Nextflow won't spawn jobs to run processes in parallel, so it may take a little longer to run but will produce the same results. Using this approach, your job needs at least 8 CPUs, at least 30GB of RAM and at least 5GB of job filesystem disk space.
cd /path/to/modkitopt
nextflow run main.nf \
--modbam ./resources/example.bam \
--mod_type m6A \
--modkit /path/to/modkit \
--fasta /path/to/gencode.v45.transcripts.fa \
--annotation /path/to/gencode.v45.annotation.gff3 \
-profile localIf your run gets interrupted, Nextflow automatically supports checkpointing and resuming runs. Simply add -resume to the Nextflow command that didn't complete and run again!
For example:
nextflow run main.nf \
--modbam /path/to/modbam.bam \
--mod_type m6A \
--modkit /path/to/modkit \
--fasta /path/to/ref.fa \
--annotation /path/to/annotation.gff3 \
-profile pbspro \
--hpc_queue normal \
--hpc_project ab12 \
--hpc_storage gdata/ab12 \
-resumeUsage:
The typical command structure for running the pipeline is as follows:
nextflow run main.nf --modbam sample.bam
--mod_type <m6A|pseU|m5C|inosine>
--modkit /path/to/modkit
--fasta /path/to/transcriptome.fa
--annotation /path/to/annotation.gff3
-profile <local|pbs|pbspro|slurm>
Mandatory arguments:
--modbam .bam file containing per-read modification calls
--mod_type Modification type (options: m6A, pseU, m5C, inosine)
--modkit Path to modkit executable
--fasta Path to reference transcriptome
--annotation Path to corresponding reference annotation (.gtf or .gff3)
-profile Execution environment (options: local, pbs, pbspro, slurm)
Mandatory arguments if running on an HPC system (-profile is pbs, pbspro or slurm):
--hpc_queue Name of the queue that Nextflow can schedule jobs to (e.g., 'normal')
--hpc_project HPC project code that Nextflow can schedule jobs to (e.g., 'ab12')
--hpc_storage HPC storage location that outputs can be written to (e.g., 'gdata/ab12')
--help This usage statement
Optional arguments:
--truth_sites .tsv file containing known modification sites (genomic 1-based coordinates, expected columns 1 and 2: [chr, pos], mandatory if mod_type is m5C or inosine)
We tested the execution time of ModkitOpt on an HPC system (NCI's gadi) with a PBSPro scheduler and the default resource settings:
- 8 CPUs & 8GB RAM for samtools filter and samtools sort
- 8 CPUs & 30GB RAM for modkit pileup
- 4 CPUs & 8GB RAM for samtools index
- 1 CPU, 30GB RAM and 5GB job filesystem disk space for converting transcriptomic to genomic coordinates
- 1 CPU with 8GB RAM for all other tasks
| modBAM file size | Mod type | Cell line | Execution time |
|---|---|---|---|
| 7.4GB | pseU | HeLa | 22 mins |
| 10.8GB | pseU | HepG2 | 27 mins |
| 12.4GB | pseU | K562 | 30 mins |
| 13.1GB | m6A | HepG2 | 29 mins |
| 13.8GB | m6A | K562 | 31 mins |
| 21GB | m6A | HEK293T | 33 mins |
The following configurations have been tested:
| Execution environment | Modkit version | Nextflow version |
|---|---|---|
| pbspro (NCI's gadi) | v0.6.0 | 24.04.5 |
| local | v0.6.0 | 25.10.0 |
The default Nextflow parameters, contained in nextflow.config and profiles/ can be overridden on the command-line.
For example, to increase the number of CPUs used for modkit pileup, you simply add --pileup_cpus 16 to your Nextflow command.
