This repository is based on the HTR-VT model from the paper "HTR-VT: Handwritten Text Recognition with Vision Transformer". We have extended the original implementation with:
- Enhanced noise handling capabilities
- Post-processing using Symspell for improved accuracy
We retrained the model from scratch with augmented data that includes:
- Horizontal lines (simulating check fields)
- Various backgrounds
- Smudges
- Salt and pepper noise
A 50:50 ratio of clean and noisy images was used during training to ensure the model learns both clean character structure and noise handling.
We've integrated Symspell for fast and accurate spell-checking, specifically optimized for bank check processing:
- Custom dictionary for numeric words commonly found in check amounts
- Fast lookup based on Damerau-Levenshtein edit distance
- Compound spell checking for sentence-level text
- Processing time in order of milliseconds
The spell-checker includes commonly used terms in bank checks:
- Basic numbers: one, two, ..., ten
- Teens: eleven, twelve, ..., twenty
- Tens: thirty, forty, ..., ninety
- Magnitudes: hundred, thousand, lakhs, million
Here are examples of our noisy training data that simulate real-world conditions:
Sample 1: Handwritten text with horizontal lines and background noise
Sample 2: French text sample with dotted line background - demonstrates model's ability to handle different writing styles and line patterns
Sample 3: Another example of French text with dotted line patterns - showing consistency in handling structured backgrounds
The model processes these noisy inputs and produces clean text output, which is then enhanced through Symspell post-processing. Output images can be found in the output/predictions/ directory after running inference.
The model achieves strong performance on test data even before post-processing:
- Character Error Rate (CER): 6.5%
- Word Error Rate (WER): 16.7%
These metrics are further improved after applying Symspell post-processing, particularly for numeric text in check amounts.
- Python 3.x
- PyTorch
- Symspell (for post-processing)
- Other dependencies listed in requirements.txt
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Unix/macOS- Install dependencies:
pip install -r requirements.txtTo run inference on test images:
# For Apple Silicon Macs (M1/M2)
export PYTORCH_ENABLE_MPS_FALLBACK=1
python test.py --out-dir ./output --exp-name iam IAM --test-data-list ./Test_Data/ --show-images True- For Apple Silicon (M1/M2) Macs, the model uses MPS (Metal Performance Shaders) backend with CPU fallback for certain operations
- Prediction images are saved in the
output/predictions/directory - Post-processing with Symspell significantly improves Character Error Rate (CER) and Word Error Rate (WER)
- No pre-processing required for noisy images
The model is trained on a combination of:
- IAM Dataset (English)
- 13,350 line-level samples
- 657 different writers
- RIMES Dataset (French characters)
- Additional line-level samples
- Over 1,300 participants
To make the model robust for real-world applications, we augment the training data with:
- Horizontal lines (simulating check fields)
- Various backgrounds
- Smudges
- Salt and pepper noise
The training uses a 50:50 ratio of clean and noisy images to ensure the model learns both clean character structure and noise handling.
This work builds upon the HTR-VT project by Yuting Li et al. We extend our gratitude to the original authors for their foundational work in handwritten text recognition using Vision Transformers.


