This tool removes duplicate read pairs from paired-end FASTQ files. It can handle a random index incorporated in the PCR product during the library construction.
It supports multiple backends for duplicate detection: in-memory, Bloom filter, and SQLite database.
- Handles paired-end FASTQ files (gzipped).
- Supports duplicate detection using:
- In-memory hash set (fastest, but high RAM usage).
- Bloom filter (low memory, allows false positives).
- SQLite (disk-based, low memory).
- Barcode handling:
- From a separate index FASTQ file (
--index). - From the read name itself (
--barcode-in-name). - Pasted to one of the reads.
- From a separate index FASTQ file (
You need the following libraries:
zlib(for.gzFASTQ support)sqlite3openssl(for SHA-256 hashing)- C++17 compiler (
g++orclang++)
sudo apt-get update
sudo apt-get install g++ make zlib1g-dev libsqlite3-dev libssl-dev
makebrew install zlib sqlite openssl
make./dedup --read1 R1.fastq.gz --read2 R2.fastq.gz [options]--read1 <read1file>: Input FASTQ (read 1, gzipped).--read2 <read2file>: Input FASTQ (read 2, gzipped).--index <indexfile>: Optional index FASTQ file.--barcode-in-name: Extract barcode from sequence name in read1.--use-memory: Use in-memory hash set (fast, high RAM).--use-bloom: Use Bloom filter (low RAM, some false positives).--use-sqlite: Use SQLite database (default, low RAM, disk usage).
Here is an example of usage with the barcodes in the names of the samples and using a Bloom filter:
./dedup --read1 R1.fastq.gz --read2 R2.fastq.gz --barcode-in-name --use-bloomOutput files will be written as:
nodup_<read1file>.fastq.gznodup_<read2file>.fastq.gz
The program can take into account the use of a random nucleotide barcodes incorporated at the PCR step of the library to identify PCR duplicates, as in the 3RAD protocol. There are two options to feed this information to the program.
If the random index is in a separate fastq file (same order as in the files that contain the reads), you can run the program using this command:
./dedup --read1 R1.fastq.gz --read2 R2.fastq.gz --index <indexfile>The random index can be placed in the name of the sample. For instance, in one sequence of the read1 fastq file:
@A01114:199:HGJMGDSXF:2:1101:1949:1016:GTGGGGGG 1:N:0:TGAGGTGT
ANCGTTGGCTAGACTGAAATAACTAGACGTCTAAGTCTAGGTCTTCTCTAGGTCGTCTTCAGGTGAACAACGAGGTCCTACAGAAGATGTTGAGATAAGAGAGGTATAAAACCGAAATAATGATTTAGAACCCGCAAAAGTTTTTGAAATA
+
F#:F:,F:FFF,F,F,FFF,FF:F,:FFF:F:F:FFFFFFFFF:FFFFF:FFFF,FFF,FFF,,FFF:FFF:F:FFFFFF:FFFF,FF,F,FFFF:FFFFFF,:FFFFFFFF,FFFF:FFFFF:FFF:FF:F::FFFFF,F::FFFF,FFF
The index is before the space in the sequence name (line begining with @). In this example, the index is GTGGGGGG. If the barcodes are presented this was in your reads 1 (forward reads), you can run the program like this:
./dedup --read1 R1.fastq.gz --read2 R2.fastq.gz --barcode-in-nameThree options are possible for the deduplication method
With this option, the reads and indexes are stored in the virtual memory and this information is read to see if a new sequence is a duplicate (i.e., is already in memory). This approach is fast, exact (makes no errors), but uses ~50–100 bytes per read stored. Should be used with datasets up to ~ 50M reads. Usage is:
./dedup --read1 R1.fastq.gz --read2 R2.fastq.gz --barcode-in-name --use-memoryThe Bloom filter is an approach often used in bioinformatics. It is a probabilistic method that trades exactness for probabilistic membership (false positive are allowed, no false negatives). This allow to save memory space, but a few unique flags may be wrongly flagged as duplicates. The false positive rate is fixed to 0.1% in the program, but this can be adjusted in the code. Usage is:
./dedup --read1 R1.fastq.gz --read2 R2.fastq.gz --barcode-in-name --use-bloomIf memory is really a problem, then it is possible to store the reads in a database on disk. This requires very little virtual memory, but it makes the program run very slowly.
./dedup --read1 R1.fastq.gz --read2 R2.fastq.gz --barcode-in-name --use-sqlite- Memory mode: Exact and fastest, but uses ~50–100 bytes per read stored. Works up to ~ 50 M reads
- Bloom filter: Memory efficient, but trades exactness for probabilistic membership (false positive allowed, no false negatives). This allow to save a lot of memory space, and a few unique flags may be wrongly flagged as duplicates. The false positive rate is adjustable (default 0.1%).
- SQLite: Saves the reads in a SQlite database. This is safe for very large datasets, but slower due to disk I/O.
Simon Joly, 2025, for the main program. Developped with support from chatGPT, but the whole script was validated by the author.
Arash Partow, 2000, for the Open Bloom Filter (bloom_filter.hpp)
Distributed under the MIT License. Feel free to use, modify, and distribute.