TRAKD is a high-performance bioinformatics toolkit for analyzing repetitive sequences in genomes using k-mer decomposition. It identifies tandem repeats, dispersed repeats, transposable elements, and analyzes their distribution patterns across genomic sequences. The tools work synergistically to distinguish between different types of repetitive DNA.
- Fast k-mer indexing with multi-threaded processing
- Tandem repeat detection using locus clustering and distance patterns
- Dispersed repeat identification for transposable elements and mobile DNA
- Distance entropy analysis to distinguish repeat types
- Intelligent filtering using cross-tool information
- Context-based similarity for grouping repeat families
- Memory-efficient processing of large genomes
- Cross-platform support (Linux, macOS, Windows)
- Visualization tools for karyotype and circular plots
- C++17 compatible compiler (g++ 7.0+ or clang++ 5.0+)
- POSIX threads (pthread)
- Make
# Clone the repository
git clone https://github.com/aglabx/TRAKD.git
cd TRAKD
# Build all tools
make
# Or build with debug symbols
make debugThe compiled executables will be in the bin/ directory.
# Analyze a genome for all repeat types
make
bin/kmer_analyzer genome.fasta kmers_top100.txt kmers.bin contigs.cidx
samtools faidx genome.fasta
bin/LocusBedGeneratorDetailed kmers.bin genome.fasta.fai tandem_repeats.bed
bin/distance_analyzer_detailed kmers.bin distances.tsv
bin/dispersed_repeat_finder kmers.bin contigs.cidx dispersed_repeats.bed distances.tsv tandem_repeats.bed
# Visualize results
python visualize_karyotype.py tandem_repeats.bed genome.fasta.fai karyotype.pngTRAKD consists of five complementary tools that work together in a pipeline. The tools can be used independently or in combination for enhanced results:
Analyzes k-mer frequencies and positions in FASTA sequences.
bin/kmer_analyzer <input.fasta> <output_top100.txt> <output_full.bin> <output_contigs.cidx> [cache.kidx] [threads]Parameters:
input.fasta: Multi-FASTA file to analyzeoutput_top100.txt: Text file with top 100 most frequent k-mersoutput_full.bin: Binary file containing all k-mer dataoutput_contigs.cidx: Contig index filecache.kidx: (Optional) Cache file for faster re-analysisthreads: (Optional) Number of threads (default: all available)
Output format (top100.txt):
AAAAAAAAAAAAA 150000 100,100,100,200,300...
TTTTTTTTTTTTT 120000 50,50,100,150...
Identifies genomic loci where k-mers cluster together.
bin/locus_bed_generator <input.bin> <reference.fai> <output.bed> [min_tf] [locus_gap] [min_locus_kmers] [threads]Parameters:
input.bin: Binary file from kmer_analyzerreference.fai: FASTA index file (.fai)output.bed: Output BED file with locimin_tf: (Optional) Minimum k-mer frequency (default: 100)locus_gap: (Optional) Maximum gap within a locus (default: 10000)min_locus_kmers: (Optional) Minimum k-mers per locus (default: 2)threads: (Optional) Number of threads
Output format (BED):
track name="KmerLoci" description="K-mer loci generated by TRAKD"
chr1 1000 2000 locus_1_len_1000
chr1 5000 7500 locus_2_len_2500
Advanced locus detection with detailed statistics and intelligent merging.
bin/LocusBedGeneratorDetailed <input.bin> <reference.fai> <output.bed> [min_tf] [locus_gap] [min_locus_kmers] [inter_overlap] [jaccard_thresh] [min_length] [threads]Additional parameters:
inter_overlap: Strong overlap threshold for merging (default: 0.9)jaccard_thresh: Jaccard similarity threshold (default: 0.5)min_length: Minimum locus length to output (default: 100)
Output format (BED with detailed annotations):
track name="DetailedKmerLoci" description="Detailed k-mer loci with statistics"
chr1 1000 2000 AAAAAAAAAAAAA|tf:150000|local_tf:1000|entropy:2.5|loci:10|max_size:5000|n_kmers:25|jaccard_min:0.3|med:0.5|max:0.8
Analyzes distance distributions between k-mer occurrences.
bin/distance_analyzer_detailed <input.bin> <output.tsv> [min_tf] [min_frac] [locus_gap] [threads]Parameters:
input.bin: Binary file from kmer_analyzeroutput.tsv: Output TSV file with analysismin_tf: (Optional) Minimum k-mer frequency (default: 0)min_frac: (Optional) Minimum distance fraction (default: 0.0)locus_gap: (Optional) Locus gap threshold (default: 10000)threads: (Optional) Number of threads
Output format (TSV):
kmer kmer_tf dist_entropy locus_count max_locus_size distances_summary
AAAAAAAAAAAAA 150000 2.5 10 5000 100:5000:0.033|200:3000:0.020|...
Identifies dispersed repeats with similar k-mer contexts (e.g., transposable elements). Can use outputs from other tools for better seed selection and region exclusion.
bin/dispersed_repeat_finder <input.bin> <contigs.cidx> <output.bed> [distance.tsv] [exclusion.bed] [min_tf] [window] [similarity] [min_inst] [max_entropy] [threads]Parameters:
input.bin: Binary file from kmer_analyzercontigs.cidx: Contig index file from kmer_analyzeroutput.bed: Output BED file with dispersed repeatsdistance.tsv: (Optional) Distance analysis file from distance_analyzer_detailed for entropy filtering (use 'none' to skip)exclusion.bed: (Optional) BED file with tandem repeat regions to exclude from LocusBedGeneratorDetailed (use 'none' to skip)min_tf: (Optional) Minimum k-mer frequency (default: 100)window: (Optional) Context window size in bp (default: 500)similarity: (Optional) Minimum context similarity 0-1 (default: 0.7)min_inst: (Optional) Minimum instances per repeat (default: 5)max_entropy: (Optional) Maximum distance entropy for seed k-mers (default: 1.5)threads: (Optional) Number of threads
Enhanced features:
- Entropy filtering: Uses distance analysis to select k-mers with high entropy (dispersed pattern)
- Region exclusion: Excludes k-mers in tandem repeat regions identified by LocusBedGeneratorDetailed
- Smart seed selection: Prioritizes k-mers likely to be mobile elements rather than tandem repeats
Output format (BED):
track name="DispersedRepeats" description="Dispersed repeats with similar k-mer contexts"
chr1 1000 1013 repeat_1_instance_1;seed=AAAAAAAAAAAAA;instances=10;similarity=0.85;upstream_context=5;downstream_context=7 850 .
chr2 5000 5013 repeat_1_instance_2;seed=AAAAAAAAAAAAA;instances=10;similarity=0.85;upstream_context=5;downstream_context=7 850 .
Here's a typical workflow for analyzing tandem repeats in a genome:
# Step 1: Index k-mers in your genome
bin/kmer_analyzer genome.fasta top100_kmers.txt kmers.bin contigs.cidx
# Step 2: Generate FASTA index if you don't have one
samtools faidx genome.fasta
# Step 3: Identify loci with clustered k-mers
bin/locus_bed_generator kmers.bin genome.fasta.fai loci.bed 100 10000 2
# Step 4: Get detailed locus analysis (optional)
bin/LocusBedGeneratorDetailed kmers.bin genome.fasta.fai detailed_loci.bed 100 10000 2 0.9 0.5
# Step 5: Analyze distance patterns (optional)
bin/distance_analyzer_detailed kmers.bin distances.tsv 100 0.01
# Step 6: Find dispersed repeats (optional)
bin/dispersed_repeat_finder kmers.bin contigs.cidx dispersed_repeats.bed
# Step 7: Visualize results (optional)
python visualize_karyotype.py detailed_loci.bed genome.fasta.fai karyotype.png --amplification 2.0
python visualize_repeats_circular.py detailed_loci.bed genome.fasta.fai circular.png --center-image logo.png
python visualize_repeats.py detailed_loci.bed genome.fasta.fai linear.png --min-tf 1000For more accurate identification of different repeat types:
# Step 1: K-mer analysis
bin/kmer_analyzer genome.fasta top100_kmers.txt kmers.bin contigs.cidx cache.kidx
# Step 2: Generate FASTA index
samtools faidx genome.fasta
# Step 3: Detailed locus analysis (for tandem repeats)
bin/LocusBedGeneratorDetailed kmers.bin genome.fasta.fai tandem_repeats.bed 100 10000 2 0.9 0.5 100
# Step 4: Distance pattern analysis
bin/distance_analyzer_detailed kmers.bin distances.tsv 100 0.01
# Step 5: Find dispersed repeats with intelligent filtering
# Uses entropy scores from distance analysis to select dispersed k-mers
# Excludes regions identified as tandem repeats
bin/dispersed_repeat_finder kmers.bin contigs.cidx dispersed_repeats.bed distances.tsv tandem_repeats.bed 100 500 0.7 5 1.5
# Step 6: Visualize both repeat types
python visualize_karyotype.py tandem_repeats.bed genome.fasta.fai tandem_karyotype.png --title "Tandem Repeats"
python visualize_karyotype.py dispersed_repeats.bed genome.fasta.fai dispersed_karyotype.png --title "Dispersed Repeats"# Use high entropy threshold to focus on dispersed elements
bin/dispersed_repeat_finder kmers.bin contigs.cidx transposons.bed distances.tsv tandem_repeats.bed 500 1000 0.8 10 2.0 8# Use detailed locus generator with strict parameters
bin/LocusBedGeneratorDetailed kmers.bin genome.fasta.fai satellites.bed 1000 5000 5 0.95 0.8 500 8The kmer_analyzer identifies all 13-mers in your sequences and tracks:
- Total frequency (tf): How many times the k-mer appears
- Positions: Where each occurrence is located
- Distances: Spacing between consecutive occurrences
Loci are regions where multiple k-mers cluster together, typically indicating:
- Tandem repeats
- Transposable element clusters
- Other repetitive elements
The distance analyzer helps identify:
- Regular spacing patterns (low entropy)
- Random distributions (high entropy)
- Clustered vs. dispersed repeat patterns
The dispersed repeat finder identifies:
- Transposable elements with conserved flanking sequences
- Repeats with similar k-mer contexts
- Families of mobile elements
- Context similarity scores between instances
Enhanced with other tool outputs:
- Uses entropy scores to select dispersed (not tandem) k-mers
- Excludes regions already identified as tandem repeats
- Focuses on true mobile elements and transposons
TRAKD tools are designed to work together for comprehensive repeat analysis:
- kmer_analyzer → Creates the foundation index for all other tools
- distance_analyzer_detailed → Provides entropy scores to distinguish repeat types:
- Low entropy (< 1.5): Regular spacing, likely tandem repeats
- High entropy (> 1.5): Irregular spacing, likely dispersed repeats
- LocusBedGeneratorDetailed → Identifies tandem repeat regions with detailed statistics
- dispersed_repeat_finder → Uses entropy scores and exclusion regions for accurate transposon detection
- Visualization tools → Create publication-ready figures for both repeat types
- Use caching: The
.kidxcache file speeds up re-analysis of the same genome - Adjust thread count: Use fewer threads if memory is limited
- Filter by frequency: Higher
min_tfvalues focus on more repetitive elements - Tune locus parameters: Adjust
locus_gapbased on your repeat characteristics - Combine tools for best results:
- Use
distance_analyzer_detailedto calculate entropy scores - Use
LocusBedGeneratorDetailedto identify tandem repeat regions - Feed both outputs to
dispersed_repeat_finderfor accurate transposon detection - Low entropy k-mers (< 1.5) are likely tandem repeats
- High entropy k-mers (> 1.5) are good candidates for dispersed repeats
- Use
- Tune parameters based on your organism:
- Birds/reptiles: Consider larger window sizes for dispersed repeats (1000bp)
- Mammals: Standard parameters work well
- Plants: May need higher frequency thresholds due to polyploidy
TRAKD is useful for:
- Genome annotation: Comprehensive repeat identification and classification
- Tandem repeat analysis: Satellites, microsatellites, centromeric/telomeric repeats
- Transposable element discovery: LINEs, SINEs, DNA transposons, LTR retrotransposons
- Genome assembly quality: Identifying repetitive regions that may cause assembly issues
- Comparative genomics: Analyzing repeat evolution across species
- Population genetics: Studying repeat polymorphisms
- Clinical genetics: Identifying pathogenic repeat expansions
If you use TRAKD in your research, please cite:
TRAKD: Tandem Repeat Analysis with K-mer Decomposition
https://github.com/aglabx/TRAKD
TRAKD is released under the MIT License. See LICENSE file for details.
TRAKD includes Python scripts for visualizing repeat patterns:
Creates a karyotype-style visualization showing k-mer repeats on paired chromosomes.
python visualize_karyotype.py detailed_loci.bed genome.fai output.png [options]
Options:
--amplification FLOAT Horizontal signal amplification (default: 1.0)
--min-tf INT Minimum k-mer frequency to display (default: 0)
--width FLOAT Figure width in inches (default: 16)
--height FLOAT Figure height in inches (default: 12)
--title TEXT Custom plot titleFeatures:
- Maternal/paternal chromosome pairing
- Signal amplification for better visibility
- All k-mer types shown in legend
- Automatic filtering of mitochondrial and rDNA sequences
Creates circular (Circos-style) or heatmap visualizations.
python visualize_repeats_circular.py detailed_loci.bed genome.fai output.png [options]
Options:
--style {circular,heatmap} Visualization style (default: circular)
--size FLOAT Figure size in inches (default: 14)
--min-tf INT Minimum k-mer frequency (default: 0)
--center-image PATH Image to place in center of circular plot
--show-legend Show k-mer legend (default: hidden)
--title TEXT Custom plot titleFeatures:
- Maternal/paternal chromosomes on different radii
- Option to add central image (logos, etc.)
- K-mer legend hidden by default for cleaner presentation
- Heatmap mode for density visualization
Creates linear chromosome visualization with all chromosomes stacked.
python visualize_repeats.py detailed_loci.bed genome.fai output.png [options]
Options:
--width FLOAT Figure width in inches (default: 16)
--min-tf INT Minimum k-mer frequency (default: 0)
--chromosomes LIST Specific chromosomes to plot
--title TEXT Custom plot titleGenerates statistical analysis and summary plots.
python analyze_repeats_stats.py detailed_loci.bed --output-prefix analysis
Outputs:
- analysis_summary.txt: Comprehensive text report
- analysis_distributions.png: Distribution plots
- analysis_composition.png: K-mer composition analysis
- analysis_kmer_stats.csv: Detailed k-mer statistics
- analysis_chrom_stats.csv: Chromosome-level statistics- K-mer grouping: Reverse complements are automatically grouped
- Color coding: Unique colors for each k-mer type using golden ratio distribution
- Signal enhancement: Amplification options for better visibility
- Multiple formats: PNG, PDF, SVG output supported
- Customization: Titles, filtering, and display options
- Attribution: All plots include "Created by TRAKD" signature
Contributions are welcome! Please feel free to submit issues or pull requests on GitHub.
For questions or support, please open an issue on the GitHub repository.