FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis

Overview

FractalBench is a comprehensive research benchmark for evaluating Vision Language Models' ability to generate fractal code from images. It serves dual purposes: educational fractal library and AI code generation benchmark. The project provides 12 clean, self-contained fractal implementations while enabling systematic evaluation of how AI models interpret fractal images and generate corresponding mathematical code through a complete evaluation pipeline.

Key Features

📚 12 fractals (Sierpiński, Koch, Dragons, and more) as standalone Python files
🤖 Test set of 610 images across 5 colors and progressive depths
🌐 Evaluate GPT-4o, Claude, Gemini, or Qwen (free) via OpenRouter
📊 Jaccard Index comparison plus 9 code complexity metrics
⚡ Custom turtle graphics in just ~80 lines, no external dependencies
🔒 LLM code runs sandboxed with timeouts and auto-imported modules
🔬 Outputs LaTeX tables and statistical tests ready for papers
🛠️ Uses uv, async API calls, and reproducible pipeline stages

Quick Start

Simple Fractal Generation

git clone https://github.com/NaiveNeuron/FractalBench.git
cd FractalBench
uv sync
uv run python fractals/sierpinski_gasket.py  # Creates sierpinski_gasket.png

Full Evaluation Pipeline

# Complete AI evaluation (estimated time: 2-3 hours)
export OPENROUTER_API_KEY='your_key_here'
uv run python scripts/generate_test_set.py      # Generate 610 reference images (~5 min)
uv run python scripts/analyze_fractals.py       # LLM analysis (~45-90 min)
uv run python scripts/execute_llm_code.py       # Execute generated code (~10 min)
uv run python scripts/evaluate_images.py        # Pixel-level evaluation (~5 min)
uv run python scripts/analyze_code_complexity.py # Code complexity analysis (~2-5 min)
uv run python scripts/generate_latex_tables.py  # Statistical analysis (~1 min)

Evaluation Pipeline Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│ Test Set        │    │ LLM Analysis     │    │ Code Execution  │
│ Generation      │───▶│ (VLM → Code)     │───▶│ & Validation    │
│ 610 images      │    │ 3 prompts × 4    │    │ Syntax + Runtime│
└─────────────────┘    │ models = 12 runs │    └─────────────────┘
                       └──────────────────┘              │
                                                         ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│ Results         │    │ Image Evaluation │    │ Generated       │
│ Analysis        │◀───│ & Comparison     │◀───│ Images          │
│ LaTeX Tables    │    │ Jaccard Index    │    │ PNG Output      │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Supported Fractals

Each fractal is implemented as a standalone .py file in /fractals/:

cantor_set.py - Classic 1D Cantor set with recursive middle-third removal
cantor_dust.py - 2D version of Cantor set using square subdivision
koch_curve.py - Koch curve and snowflake with equilateral triangle bumps
koch_snowflake.py - Standalone Koch snowflake implementation
sierpinski_gasket.py - Sierpiński triangle with recursive subdivision
sierpinski_carpet.py - 2D Sierpiński carpet with square grid subdivision
sierpinski_pentagon.py - Pentagonal version using golden ratio scaling
heighway_dragon.py - Space-filling dragon curve from paper folding
levy_dragon.py - Lévy dragon with 45° turn patterns
mcworter_pentigree.py - Tree-like fractal with 5-way branching
pythagoras_tree.py - Recursive squares forming tree structure
symmetric_binary_tree.py - Binary tree with symmetric branches

Installation & Dependencies

Prerequisites

Python 3.10+
uv (recommended) or pip
OpenRouter API key (for LLM evaluation)

Quick Setup

git clone https://github.com/NaiveNeuron/FractalBench.git
cd FractalBench
uv sync

Environment Configuration

# Required for LLM evaluation pipeline
export OPENROUTER_API_KEY='your_key_here'

# Optional configuration
export MAX_CONCURRENT_REQUESTS=5    # API rate limiting
export OUTPUT_DIR='custom_output'   # Custom output directory

Usage Modes

Educational Use: Individual Fractals

# Generate single fractals
uv run python fractals/sierpinski_gasket.py   # Creates sierpinski_gasket.png
uv run python fractals/koch_snowflake.py      # Creates koch_snowflake.png
uv run python fractals/heighway_dragon.py     # Creates heighway_dragon.png

# All fractals in test set format
uv run python scripts/generate_test_set.py    # Creates data/test_set/ with 610 images

Research Use: Complete Evaluation Pipeline

# Step 1: Generate reference dataset (610 images, ~5 minutes)
uv run python scripts/generate_test_set.py

# Step 2: LLM analysis (3 prompts × 4 models, ~45-90 minutes)
uv run python scripts/analyze_fractals.py

# Step 3: Execute generated code (~10 minutes)
uv run python scripts/execute_llm_code.py

# Step 4: Pixel-level evaluation (~5 minutes)
uv run python scripts/evaluate_images.py

# Step 5: Statistical analysis and LaTeX output (~1 minute)
uv run python scripts/generate_latex_tables.py

Supported Models & Prompts

LLM Models

Model	Provider	Cost (1M tokens)	Capability
GPT-4o	OpenAI	~$15	Best vision understanding
Claude 3.7 Sonnet	Anthropic	~$3	Strong reasoning & code
Gemini 2.5 Flash	Google	~$0.30	Cost-effective
Qwen 2.5-VL	Qwen	Free	Open-source model

Prompt Strategies

direct_code: Skip reasoning, generate code directly from image
reason_then_code: Analyze fractal structure, then implement
recursive_focus: Emphasize recursive patterns and base cases

Cost Estimation

Full Evaluation (610 images × 3 prompts): $8-25 depending on model choice
Free Tier Available: Qwen model provides no-cost evaluation option

Evaluation Pipeline Details

Stage 1: Test Set Generation

uv run python scripts/generate_test_set.py

Output: 610 images (122 per color × 5 colors)
Organization: data/test_set/{color}/{fractal}_{depth}_{size}.png
Features: Progressive depth sequences, automatic size optimization
Time: ~5 minutes

Stage 2: LLM Analysis

uv run python scripts/analyze_fractals.py

Input: Select prompt strategy and model(s)
Processing: Concurrent API calls with rate limiting
Output: data/analysis_results/{prompt}/{model}/{color}/{fractal}.json
Features: Progress tracking, error handling, usage statistics
Time: 45-90 minutes (varies by model and concurrency)

Stage 3: Code Execution

uv run python scripts/execute_llm_code.py

Function: Extract Python code from LLM responses, make executable
Safety: Syntax validation, sandboxed execution
Output: data/generated_images/ with rendered fractal images
Features: Error logging, execution time tracking
Time: ~10 minutes

Stage 4: Image Evaluation

uv run python scripts/evaluate_images.py

Method: Pixel-level comparison using Jaccard Index (IoU)
Processing: Binary image conversion, similarity calculation
Output: results/evaluation_results.json with detailed metrics
Metrics: Success rates, similarity scores, failure analysis
Time: ~5 minutes

Stage 5: Code Complexity Analysis

Analyzes structural complexity of LLM-generated code across various metrics (LOC, turtle calls, function definitions, loops, conditionals, etc.).

uv run python scripts/analyze_code_complexity.py

Output: JSON summaries + PDF visualizations in results/code_complexity/
Metrics: Lines of code, function calls, control flow structures, etc.
Time: ~2-5 minutes

Stage 6: Statistical Analysis

uv run python scripts/generate_latex_tables.py

Analysis: Model comparison, prompt effectiveness, fractal difficulty
Output: LaTeX tables, statistical visualizations
Features: Significance testing, distribution analysis
Formats: Ready for research papers and presentations
Time: ~1 minute

Directory Structure

After running the complete pipeline, your directory will contain:

FractalBench/
├── fractals/                    # 12 fractal implementations
│   ├── sierpinski_gasket.py     # Triangle fractal
│   ├── koch_snowflake.py        # Classic snowflake
│   └── ...
├── utils/                       # Core infrastructure
│   ├── minimal_turtle.py        # Turtle graphics implementation
│   ├── minimal_renderer.py      # PNG rendering
│   └── fractal_props.py         # Depth calculation
├── scripts/                     # Pipeline scripts
│   ├── generate_test_set.py     # Reference image generation
│   ├── analyze_fractals.py      # LLM analysis
│   ├── execute_llm_code.py      # Code execution
│   ├── evaluate_images.py       # Image comparison
│   ├── analyze_code_complexity.py # Code metrics
│   ├── generate_latex_tables.py # Statistical output
│   └── demo.py                  # Demo fractal gallery
├── static/                      # Static assets
│   └── fractal_gallery.png      # Overview image
├── data/                        # All benchmark data
│   ├── test_set/                # Reference images (610 total)
│   │   ├── black/               # Black fractals
│   │   ├── blue/                # Blue fractals
│   │   └── ...                  # 5 colors total
│   ├── analysis_results/        # LLM-generated code
│   │   ├── direct_code/         # Direct prompt strategy
│   │   ├── reason_then_code/    # Reasoning prompt strategy
│   │   └── recursive_focus/     # Recursive prompt strategy
│   └── generated_images/        # LLM-rendered images
└── results/                     # Evaluation outputs
    ├── evaluation_results.json  # Similarity metrics
    ├── latex_tables.tex         # Research output
    └── code_complexity/         # Code complexity analysis
        ├── by_prompt/           # Group by prompt type
        ├── by_fractal/          # Group by fractal type
        └── *_summary.json       # Complete metrics

Citation

If you use FractalBench in your research, please cite:

@inproceedings{ondras2025fractalbench,
  title={FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis},
  author={Jan Ondras and Marek Suppa},
  booktitle={The 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025},
  year={2025},
  url={https://openreview.net/forum?id=DxsvO2iHnz}
}

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
fractals		fractals
results		results
scripts		scripts
static		static
utils		utils
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis

Overview

Key Features

Quick Start

Simple Fractal Generation

Full Evaluation Pipeline

Evaluation Pipeline Architecture

Supported Fractals

Installation & Dependencies

Prerequisites

Quick Setup

Environment Configuration

Usage Modes

Educational Use: Individual Fractals

Research Use: Complete Evaluation Pipeline

Supported Models & Prompts

LLM Models

Prompt Strategies

Cost Estimation

Evaluation Pipeline Details

Stage 1: Test Set Generation

Stage 2: LLM Analysis

Stage 3: Code Execution

Stage 4: Image Evaluation

Stage 5: Code Complexity Analysis

Stage 6: Statistical Analysis

Directory Structure

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

NaiveNeuron/FractalBench

Folders and files

Latest commit

History

Repository files navigation

FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis

Overview

Key Features

Quick Start

Simple Fractal Generation

Full Evaluation Pipeline

Evaluation Pipeline Architecture

Supported Fractals

Installation & Dependencies

Prerequisites

Quick Setup

Environment Configuration

Usage Modes

Educational Use: Individual Fractals

Research Use: Complete Evaluation Pipeline

Supported Models & Prompts

LLM Models

Prompt Strategies

Cost Estimation

Evaluation Pipeline Details

Stage 1: Test Set Generation

Stage 2: LLM Analysis

Stage 3: Code Execution

Stage 4: Image Evaluation

Stage 5: Code Complexity Analysis

Stage 6: Statistical Analysis

Directory Structure

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages