TRAKD - Tandem Repeat Analysis with K-mer Decomposition

TRAKD is a high-performance bioinformatics toolkit for analyzing repetitive sequences in genomes using k-mer decomposition. It identifies tandem repeats, dispersed repeats, transposable elements, and analyzes their distribution patterns across genomic sequences. The tools work synergistically to distinguish between different types of repetitive DNA.

Features

Fast k-mer indexing with multi-threaded processing
Tandem repeat detection using locus clustering and distance patterns
Dispersed repeat identification for transposable elements and mobile DNA
Distance entropy analysis to distinguish repeat types
Intelligent filtering using cross-tool information
Context-based similarity for grouping repeat families
Memory-efficient processing of large genomes
Cross-platform support (Linux, macOS, Windows)
Visualization tools for karyotype and circular plots

Installation

Requirements

C++17 compatible compiler (g++ 7.0+ or clang++ 5.0+)
POSIX threads (pthread)
Make

Building from source

# Clone the repository
git clone https://github.com/aglabx/TRAKD.git
cd TRAKD

# Build all tools
make

# Or build with debug symbols
make debug

The compiled executables will be in the bin/ directory.

Quick Start

# Analyze a genome for all repeat types
make
bin/kmer_analyzer genome.fasta kmers_top100.txt kmers.bin contigs.cidx
samtools faidx genome.fasta
bin/LocusBedGeneratorDetailed kmers.bin genome.fasta.fai tandem_repeats.bed
bin/distance_analyzer_detailed kmers.bin distances.tsv
bin/dispersed_repeat_finder kmers.bin contigs.cidx dispersed_repeats.bed distances.tsv tandem_repeats.bed

# Visualize results
python visualize_karyotype.py tandem_repeats.bed genome.fasta.fai karyotype.png

Tools Overview

TRAKD consists of five complementary tools that work together in a pipeline. The tools can be used independently or in combination for enhanced results:

1. kmer_analyzer

Analyzes k-mer frequencies and positions in FASTA sequences.

bin/kmer_analyzer <input.fasta> <output_top100.txt> <output_full.bin> <output_contigs.cidx> [cache.kidx] [threads]

Parameters:

input.fasta: Multi-FASTA file to analyze
output_top100.txt: Text file with top 100 most frequent k-mers
output_full.bin: Binary file containing all k-mer data
output_contigs.cidx: Contig index file
cache.kidx: (Optional) Cache file for faster re-analysis
threads: (Optional) Number of threads (default: all available)

Output format (top100.txt):

AAAAAAAAAAAAA   150000  100,100,100,200,300...
TTTTTTTTTTTTT   120000  50,50,100,150...

2. locus_bed_generator

Identifies genomic loci where k-mers cluster together.

bin/locus_bed_generator <input.bin> <reference.fai> <output.bed> [min_tf] [locus_gap] [min_locus_kmers] [threads]

Parameters:

input.bin: Binary file from kmer_analyzer
reference.fai: FASTA index file (.fai)
output.bed: Output BED file with loci
min_tf: (Optional) Minimum k-mer frequency (default: 100)
locus_gap: (Optional) Maximum gap within a locus (default: 10000)
min_locus_kmers: (Optional) Minimum k-mers per locus (default: 2)
threads: (Optional) Number of threads

Output format (BED):

track name="KmerLoci" description="K-mer loci generated by TRAKD"
chr1    1000    2000    locus_1_len_1000
chr1    5000    7500    locus_2_len_2500

3. LocusBedGeneratorDetailed

Advanced locus detection with detailed statistics and intelligent merging.

bin/LocusBedGeneratorDetailed <input.bin> <reference.fai> <output.bed> [min_tf] [locus_gap] [min_locus_kmers] [inter_overlap] [jaccard_thresh] [min_length] [threads]

Additional parameters:

inter_overlap: Strong overlap threshold for merging (default: 0.9)
jaccard_thresh: Jaccard similarity threshold (default: 0.5)
min_length: Minimum locus length to output (default: 100)

Output format (BED with detailed annotations):

track name="DetailedKmerLoci" description="Detailed k-mer loci with statistics"
chr1    1000    2000    AAAAAAAAAAAAA|tf:150000|local_tf:1000|entropy:2.5|loci:10|max_size:5000|n_kmers:25|jaccard_min:0.3|med:0.5|max:0.8

4. distance_analyzer_detailed

Analyzes distance distributions between k-mer occurrences.

bin/distance_analyzer_detailed <input.bin> <output.tsv> [min_tf] [min_frac] [locus_gap] [threads]

Parameters:

input.bin: Binary file from kmer_analyzer
output.tsv: Output TSV file with analysis
min_tf: (Optional) Minimum k-mer frequency (default: 0)
min_frac: (Optional) Minimum distance fraction (default: 0.0)
locus_gap: (Optional) Locus gap threshold (default: 10000)
threads: (Optional) Number of threads

Output format (TSV):

kmer    kmer_tf    dist_entropy    locus_count    max_locus_size    distances_summary
AAAAAAAAAAAAA    150000    2.5    10    5000    100:5000:0.033|200:3000:0.020|...

5. dispersed_repeat_finder

Identifies dispersed repeats with similar k-mer contexts (e.g., transposable elements). Can use outputs from other tools for better seed selection and region exclusion.

bin/dispersed_repeat_finder <input.bin> <contigs.cidx> <output.bed> [distance.tsv] [exclusion.bed] [min_tf] [window] [similarity] [min_inst] [max_entropy] [threads]

Parameters:

input.bin: Binary file from kmer_analyzer
contigs.cidx: Contig index file from kmer_analyzer
output.bed: Output BED file with dispersed repeats
distance.tsv: (Optional) Distance analysis file from distance_analyzer_detailed for entropy filtering (use 'none' to skip)
exclusion.bed: (Optional) BED file with tandem repeat regions to exclude from LocusBedGeneratorDetailed (use 'none' to skip)
min_tf: (Optional) Minimum k-mer frequency (default: 100)
window: (Optional) Context window size in bp (default: 500)
similarity: (Optional) Minimum context similarity 0-1 (default: 0.7)
min_inst: (Optional) Minimum instances per repeat (default: 5)
max_entropy: (Optional) Maximum distance entropy for seed k-mers (default: 1.5)
threads: (Optional) Number of threads

Enhanced features:

Entropy filtering: Uses distance analysis to select k-mers with high entropy (dispersed pattern)
Region exclusion: Excludes k-mers in tandem repeat regions identified by LocusBedGeneratorDetailed
Smart seed selection: Prioritizes k-mers likely to be mobile elements rather than tandem repeats

Output format (BED):

track name="DispersedRepeats" description="Dispersed repeats with similar k-mer contexts"
chr1    1000    1013    repeat_1_instance_1;seed=AAAAAAAAAAAAA;instances=10;similarity=0.85;upstream_context=5;downstream_context=7    850    .
chr2    5000    5013    repeat_1_instance_2;seed=AAAAAAAAAAAAA;instances=10;similarity=0.85;upstream_context=5;downstream_context=7    850    .

Usage Examples

Basic Workflow

Here's a typical workflow for analyzing tandem repeats in a genome:

# Step 1: Index k-mers in your genome
bin/kmer_analyzer genome.fasta top100_kmers.txt kmers.bin contigs.cidx

# Step 2: Generate FASTA index if you don't have one
samtools faidx genome.fasta

# Step 3: Identify loci with clustered k-mers
bin/locus_bed_generator kmers.bin genome.fasta.fai loci.bed 100 10000 2

# Step 4: Get detailed locus analysis (optional)
bin/LocusBedGeneratorDetailed kmers.bin genome.fasta.fai detailed_loci.bed 100 10000 2 0.9 0.5

# Step 5: Analyze distance patterns (optional)
bin/distance_analyzer_detailed kmers.bin distances.tsv 100 0.01

# Step 6: Find dispersed repeats (optional)
bin/dispersed_repeat_finder kmers.bin contigs.cidx dispersed_repeats.bed

# Step 7: Visualize results (optional)
python visualize_karyotype.py detailed_loci.bed genome.fasta.fai karyotype.png --amplification 2.0
python visualize_repeats_circular.py detailed_loci.bed genome.fasta.fai circular.png --center-image logo.png
python visualize_repeats.py detailed_loci.bed genome.fasta.fai linear.png --min-tf 1000

Advanced Workflow - Integrated Analysis

For more accurate identification of different repeat types:

# Step 1: K-mer analysis
bin/kmer_analyzer genome.fasta top100_kmers.txt kmers.bin contigs.cidx cache.kidx

# Step 2: Generate FASTA index
samtools faidx genome.fasta

# Step 3: Detailed locus analysis (for tandem repeats)
bin/LocusBedGeneratorDetailed kmers.bin genome.fasta.fai tandem_repeats.bed 100 10000 2 0.9 0.5 100

# Step 4: Distance pattern analysis
bin/distance_analyzer_detailed kmers.bin distances.tsv 100 0.01

# Step 5: Find dispersed repeats with intelligent filtering
# Uses entropy scores from distance analysis to select dispersed k-mers
# Excludes regions identified as tandem repeats
bin/dispersed_repeat_finder kmers.bin contigs.cidx dispersed_repeats.bed distances.tsv tandem_repeats.bed 100 500 0.7 5 1.5

# Step 6: Visualize both repeat types
python visualize_karyotype.py tandem_repeats.bed genome.fasta.fai tandem_karyotype.png --title "Tandem Repeats"
python visualize_karyotype.py dispersed_repeats.bed genome.fasta.fai dispersed_karyotype.png --title "Dispersed Repeats"

Targeted Analysis Examples

Finding Transposable Elements

# Use high entropy threshold to focus on dispersed elements
bin/dispersed_repeat_finder kmers.bin contigs.cidx transposons.bed distances.tsv tandem_repeats.bed 500 1000 0.8 10 2.0 8

Finding Satellite DNA

# Use detailed locus generator with strict parameters
bin/LocusBedGeneratorDetailed kmers.bin genome.fasta.fai satellites.bed 1000 5000 5 0.95 0.8 500 8

Understanding the Output

K-mer Frequencies

The kmer_analyzer identifies all 13-mers in your sequences and tracks:

Total frequency (tf): How many times the k-mer appears
Positions: Where each occurrence is located
Distances: Spacing between consecutive occurrences

Loci Detection

Loci are regions where multiple k-mers cluster together, typically indicating:

Tandem repeats
Transposable element clusters
Other repetitive elements

Distance Analysis

The distance analyzer helps identify:

Regular spacing patterns (low entropy)
Random distributions (high entropy)
Clustered vs. dispersed repeat patterns

Dispersed Repeats

The dispersed repeat finder identifies:

Transposable elements with conserved flanking sequences
Repeats with similar k-mer contexts
Families of mobile elements
Context similarity scores between instances

Enhanced with other tool outputs:

Uses entropy scores to select dispersed (not tandem) k-mers
Excludes regions already identified as tandem repeats
Focuses on true mobile elements and transposons

Tool Synergy

TRAKD tools are designed to work together for comprehensive repeat analysis:

kmer_analyzer → Creates the foundation index for all other tools
distance_analyzer_detailed → Provides entropy scores to distinguish repeat types:
- Low entropy (< 1.5): Regular spacing, likely tandem repeats
- High entropy (> 1.5): Irregular spacing, likely dispersed repeats
LocusBedGeneratorDetailed → Identifies tandem repeat regions with detailed statistics
dispersed_repeat_finder → Uses entropy scores and exclusion regions for accurate transposon detection
Visualization tools → Create publication-ready figures for both repeat types

Performance Tips

Use caching: The .kidx cache file speeds up re-analysis of the same genome
Adjust thread count: Use fewer threads if memory is limited
Filter by frequency: Higher min_tf values focus on more repetitive elements
Tune locus parameters: Adjust locus_gap based on your repeat characteristics
Combine tools for best results:
- Use distance_analyzer_detailed to calculate entropy scores
- Use LocusBedGeneratorDetailed to identify tandem repeat regions
- Feed both outputs to dispersed_repeat_finder for accurate transposon detection
- Low entropy k-mers (< 1.5) are likely tandem repeats
- High entropy k-mers (> 1.5) are good candidates for dispersed repeats
Tune parameters based on your organism:
- Birds/reptiles: Consider larger window sizes for dispersed repeats (1000bp)
- Mammals: Standard parameters work well
- Plants: May need higher frequency thresholds due to polyploidy

Applications

TRAKD is useful for:

Genome annotation: Comprehensive repeat identification and classification
Tandem repeat analysis: Satellites, microsatellites, centromeric/telomeric repeats
Transposable element discovery: LINEs, SINEs, DNA transposons, LTR retrotransposons
Genome assembly quality: Identifying repetitive regions that may cause assembly issues
Comparative genomics: Analyzing repeat evolution across species
Population genetics: Studying repeat polymorphisms
Clinical genetics: Identifying pathogenic repeat expansions

Citation

If you use TRAKD in your research, please cite:

TRAKD: Tandem Repeat Analysis with K-mer Decomposition
https://github.com/aglabx/TRAKD

License

TRAKD is released under the MIT License. See LICENSE file for details.

Visualization Tools

TRAKD includes Python scripts for visualizing repeat patterns:

visualize_karyotype.py

Creates a karyotype-style visualization showing k-mer repeats on paired chromosomes.

python visualize_karyotype.py detailed_loci.bed genome.fai output.png [options]

Options:
  --amplification FLOAT  Horizontal signal amplification (default: 1.0)
  --min-tf INT          Minimum k-mer frequency to display (default: 0)
  --width FLOAT         Figure width in inches (default: 16)
  --height FLOAT        Figure height in inches (default: 12)
  --title TEXT          Custom plot title

Features:

Maternal/paternal chromosome pairing
Signal amplification for better visibility
All k-mer types shown in legend
Automatic filtering of mitochondrial and rDNA sequences

visualize_repeats_circular.py

Creates circular (Circos-style) or heatmap visualizations.

python visualize_repeats_circular.py detailed_loci.bed genome.fai output.png [options]

Options:
  --style {circular,heatmap}  Visualization style (default: circular)
  --size FLOAT               Figure size in inches (default: 14)
  --min-tf INT               Minimum k-mer frequency (default: 0)
  --center-image PATH        Image to place in center of circular plot
  --show-legend              Show k-mer legend (default: hidden)
  --title TEXT               Custom plot title

Features:

Maternal/paternal chromosomes on different radii
Option to add central image (logos, etc.)
K-mer legend hidden by default for cleaner presentation
Heatmap mode for density visualization

visualize_repeats.py

Creates linear chromosome visualization with all chromosomes stacked.

python visualize_repeats.py detailed_loci.bed genome.fai output.png [options]

Options:
  --width FLOAT         Figure width in inches (default: 16)
  --min-tf INT         Minimum k-mer frequency (default: 0)
  --chromosomes LIST   Specific chromosomes to plot
  --title TEXT         Custom plot title

analyze_repeats_stats.py

Generates statistical analysis and summary plots.

python analyze_repeats_stats.py detailed_loci.bed --output-prefix analysis

Outputs:
  - analysis_summary.txt: Comprehensive text report
  - analysis_distributions.png: Distribution plots
  - analysis_composition.png: K-mer composition analysis
  - analysis_kmer_stats.csv: Detailed k-mer statistics
  - analysis_chrom_stats.csv: Chromosome-level statistics

Visualization Features

K-mer grouping: Reverse complements are automatically grouped
Color coding: Unique colors for each k-mer type using golden ratio distribution
Signal enhancement: Amplification options for better visibility
Multiple formats: PNG, PDF, SVG output supported
Customization: Titles, filtering, and display options
Attribution: All plots include "Created by TRAKD" signature

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests on GitHub.

Contact

For questions or support, please open an issue on the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
analyze_repeats_stats.py		analyze_repeats_stats.py
visualize_karyotype.py		visualize_karyotype.py
visualize_repeats.py		visualize_repeats.py
visualize_repeats_circular.py		visualize_repeats_circular.py

License

aglabx/TRAKD

Folders and files

Latest commit

History

Repository files navigation

TRAKD - Tandem Repeat Analysis with K-mer Decomposition

Features

Installation

Requirements

Building from source

Quick Start

Tools Overview

1. kmer_analyzer

2. locus_bed_generator

3. LocusBedGeneratorDetailed

4. distance_analyzer_detailed

5. dispersed_repeat_finder

Usage Examples

Basic Workflow

Advanced Workflow - Integrated Analysis

Targeted Analysis Examples

Finding Transposable Elements

Finding Satellite DNA

Understanding the Output

K-mer Frequencies

Loci Detection

Distance Analysis

Dispersed Repeats

Tool Synergy

Performance Tips

Applications

Citation

License

Visualization Tools

visualize_karyotype.py

visualize_repeats_circular.py

visualize_repeats.py

analyze_repeats_stats.py

Visualization Features

Contributing

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages