FractalBench is a comprehensive research benchmark for evaluating Vision Language Models' ability to generate fractal code from images. It serves dual purposes: educational fractal library and AI code generation benchmark. The project provides 12 clean, self-contained fractal implementations while enabling systematic evaluation of how AI models interpret fractal images and generate corresponding mathematical code through a complete evaluation pipeline.
- 📚 12 fractals (Sierpiński, Koch, Dragons, and more) as standalone Python files
- 🤖 Test set of 610 images across 5 colors and progressive depths
- 🌐 Evaluate GPT-4o, Claude, Gemini, or Qwen (free) via OpenRouter
- 📊 Jaccard Index comparison plus 9 code complexity metrics
- ⚡ Custom turtle graphics in just ~80 lines, no external dependencies
- 🔒 LLM code runs sandboxed with timeouts and auto-imported modules
- 🔬 Outputs LaTeX tables and statistical tests ready for papers
- 🛠️ Uses
uv, async API calls, and reproducible pipeline stages
git clone https://github.com/NaiveNeuron/FractalBench.git
cd FractalBench
uv sync
uv run python fractals/sierpinski_gasket.py # Creates sierpinski_gasket.png# Complete AI evaluation (estimated time: 2-3 hours)
export OPENROUTER_API_KEY='your_key_here'
uv run python scripts/generate_test_set.py # Generate 610 reference images (~5 min)
uv run python scripts/analyze_fractals.py # LLM analysis (~45-90 min)
uv run python scripts/execute_llm_code.py # Execute generated code (~10 min)
uv run python scripts/evaluate_images.py # Pixel-level evaluation (~5 min)
uv run python scripts/analyze_code_complexity.py # Code complexity analysis (~2-5 min)
uv run python scripts/generate_latex_tables.py # Statistical analysis (~1 min)┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Test Set │ │ LLM Analysis │ │ Code Execution │
│ Generation │───▶│ (VLM → Code) │───▶│ & Validation │
│ 610 images │ │ 3 prompts × 4 │ │ Syntax + Runtime│
└─────────────────┘ │ models = 12 runs │ └─────────────────┘
└──────────────────┘ │
▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Results │ │ Image Evaluation │ │ Generated │
│ Analysis │◀───│ & Comparison │◀───│ Images │
│ LaTeX Tables │ │ Jaccard Index │ │ PNG Output │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Each fractal is implemented as a standalone .py file in /fractals/:
- cantor_set.py - Classic 1D Cantor set with recursive middle-third removal
- cantor_dust.py - 2D version of Cantor set using square subdivision
- koch_curve.py - Koch curve and snowflake with equilateral triangle bumps
- koch_snowflake.py - Standalone Koch snowflake implementation
- sierpinski_gasket.py - Sierpiński triangle with recursive subdivision
- sierpinski_carpet.py - 2D Sierpiński carpet with square grid subdivision
- sierpinski_pentagon.py - Pentagonal version using golden ratio scaling
- heighway_dragon.py - Space-filling dragon curve from paper folding
- levy_dragon.py - Lévy dragon with 45° turn patterns
- mcworter_pentigree.py - Tree-like fractal with 5-way branching
- pythagoras_tree.py - Recursive squares forming tree structure
- symmetric_binary_tree.py - Binary tree with symmetric branches
- Python 3.10+
- uv (recommended) or pip
- OpenRouter API key (for LLM evaluation)
git clone https://github.com/NaiveNeuron/FractalBench.git
cd FractalBench
uv sync# Required for LLM evaluation pipeline
export OPENROUTER_API_KEY='your_key_here'
# Optional configuration
export MAX_CONCURRENT_REQUESTS=5 # API rate limiting
export OUTPUT_DIR='custom_output' # Custom output directory# Generate single fractals
uv run python fractals/sierpinski_gasket.py # Creates sierpinski_gasket.png
uv run python fractals/koch_snowflake.py # Creates koch_snowflake.png
uv run python fractals/heighway_dragon.py # Creates heighway_dragon.png
# All fractals in test set format
uv run python scripts/generate_test_set.py # Creates data/test_set/ with 610 images# Step 1: Generate reference dataset (610 images, ~5 minutes)
uv run python scripts/generate_test_set.py
# Step 2: LLM analysis (3 prompts × 4 models, ~45-90 minutes)
uv run python scripts/analyze_fractals.py
# Step 3: Execute generated code (~10 minutes)
uv run python scripts/execute_llm_code.py
# Step 4: Pixel-level evaluation (~5 minutes)
uv run python scripts/evaluate_images.py
# Step 5: Statistical analysis and LaTeX output (~1 minute)
uv run python scripts/generate_latex_tables.py| Model | Provider | Cost (1M tokens) | Capability |
|---|---|---|---|
| GPT-4o | OpenAI | ~$15 | Best vision understanding |
| Claude 3.7 Sonnet | Anthropic | ~$3 | Strong reasoning & code |
| Gemini 2.5 Flash | ~$0.30 | Cost-effective | |
| Qwen 2.5-VL | Qwen | Free | Open-source model |
- direct_code: Skip reasoning, generate code directly from image
- reason_then_code: Analyze fractal structure, then implement
- recursive_focus: Emphasize recursive patterns and base cases
- Full Evaluation (610 images × 3 prompts): $8-25 depending on model choice
- Free Tier Available: Qwen model provides no-cost evaluation option
uv run python scripts/generate_test_set.py- Output: 610 images (122 per color × 5 colors)
- Organization:
data/test_set/{color}/{fractal}_{depth}_{size}.png - Features: Progressive depth sequences, automatic size optimization
- Time: ~5 minutes
uv run python scripts/analyze_fractals.py- Input: Select prompt strategy and model(s)
- Processing: Concurrent API calls with rate limiting
- Output:
data/analysis_results/{prompt}/{model}/{color}/{fractal}.json - Features: Progress tracking, error handling, usage statistics
- Time: 45-90 minutes (varies by model and concurrency)
uv run python scripts/execute_llm_code.py- Function: Extract Python code from LLM responses, make executable
- Safety: Syntax validation, sandboxed execution
- Output:
data/generated_images/with rendered fractal images - Features: Error logging, execution time tracking
- Time: ~10 minutes
uv run python scripts/evaluate_images.py- Method: Pixel-level comparison using Jaccard Index (IoU)
- Processing: Binary image conversion, similarity calculation
- Output:
results/evaluation_results.jsonwith detailed metrics - Metrics: Success rates, similarity scores, failure analysis
- Time: ~5 minutes
Analyzes structural complexity of LLM-generated code across various metrics (LOC, turtle calls, function definitions, loops, conditionals, etc.).
uv run python scripts/analyze_code_complexity.py- Output: JSON summaries + PDF visualizations in
results/code_complexity/ - Metrics: Lines of code, function calls, control flow structures, etc.
- Time: ~2-5 minutes
uv run python scripts/generate_latex_tables.py- Analysis: Model comparison, prompt effectiveness, fractal difficulty
- Output: LaTeX tables, statistical visualizations
- Features: Significance testing, distribution analysis
- Formats: Ready for research papers and presentations
- Time: ~1 minute
After running the complete pipeline, your directory will contain:
FractalBench/
├── fractals/ # 12 fractal implementations
│ ├── sierpinski_gasket.py # Triangle fractal
│ ├── koch_snowflake.py # Classic snowflake
│ └── ...
├── utils/ # Core infrastructure
│ ├── minimal_turtle.py # Turtle graphics implementation
│ ├── minimal_renderer.py # PNG rendering
│ └── fractal_props.py # Depth calculation
├── scripts/ # Pipeline scripts
│ ├── generate_test_set.py # Reference image generation
│ ├── analyze_fractals.py # LLM analysis
│ ├── execute_llm_code.py # Code execution
│ ├── evaluate_images.py # Image comparison
│ ├── analyze_code_complexity.py # Code metrics
│ ├── generate_latex_tables.py # Statistical output
│ └── demo.py # Demo fractal gallery
├── static/ # Static assets
│ └── fractal_gallery.png # Overview image
├── data/ # All benchmark data
│ ├── test_set/ # Reference images (610 total)
│ │ ├── black/ # Black fractals
│ │ ├── blue/ # Blue fractals
│ │ └── ... # 5 colors total
│ ├── analysis_results/ # LLM-generated code
│ │ ├── direct_code/ # Direct prompt strategy
│ │ ├── reason_then_code/ # Reasoning prompt strategy
│ │ └── recursive_focus/ # Recursive prompt strategy
│ └── generated_images/ # LLM-rendered images
└── results/ # Evaluation outputs
├── evaluation_results.json # Similarity metrics
├── latex_tables.tex # Research output
└── code_complexity/ # Code complexity analysis
├── by_prompt/ # Group by prompt type
├── by_fractal/ # Group by fractal type
└── *_summary.json # Complete metrics
If you use FractalBench in your research, please cite:
@inproceedings{ondras2025fractalbench,
title={FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis},
author={Jan Ondras and Marek Suppa},
booktitle={The 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025},
year={2025},
url={https://openreview.net/forum?id=DxsvO2iHnz}
}MIT License
