Skip to content

NaiveNeuron/FractalBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis

arXiv OpenReview License: MIT

Overview

FractalBench is a comprehensive research benchmark for evaluating Vision Language Models' ability to generate fractal code from images. It serves dual purposes: educational fractal library and AI code generation benchmark. The project provides 12 clean, self-contained fractal implementations while enabling systematic evaluation of how AI models interpret fractal images and generate corresponding mathematical code through a complete evaluation pipeline.

FractalBench Gallery

Key Features

  • 📚 12 fractals (Sierpiński, Koch, Dragons, and more) as standalone Python files
  • 🤖 Test set of 610 images across 5 colors and progressive depths
  • 🌐 Evaluate GPT-4o, Claude, Gemini, or Qwen (free) via OpenRouter
  • 📊 Jaccard Index comparison plus 9 code complexity metrics
  • ⚡ Custom turtle graphics in just ~80 lines, no external dependencies
  • 🔒 LLM code runs sandboxed with timeouts and auto-imported modules
  • 🔬 Outputs LaTeX tables and statistical tests ready for papers
  • 🛠️ Uses uv, async API calls, and reproducible pipeline stages

Quick Start

Simple Fractal Generation

git clone https://github.com/NaiveNeuron/FractalBench.git
cd FractalBench
uv sync
uv run python fractals/sierpinski_gasket.py  # Creates sierpinski_gasket.png

Full Evaluation Pipeline

# Complete AI evaluation (estimated time: 2-3 hours)
export OPENROUTER_API_KEY='your_key_here'
uv run python scripts/generate_test_set.py      # Generate 610 reference images (~5 min)
uv run python scripts/analyze_fractals.py       # LLM analysis (~45-90 min)
uv run python scripts/execute_llm_code.py       # Execute generated code (~10 min)
uv run python scripts/evaluate_images.py        # Pixel-level evaluation (~5 min)
uv run python scripts/analyze_code_complexity.py # Code complexity analysis (~2-5 min)
uv run python scripts/generate_latex_tables.py  # Statistical analysis (~1 min)

Evaluation Pipeline Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│ Test Set        │    │ LLM Analysis     │    │ Code Execution  │
│ Generation      │───▶│ (VLM → Code)     │───▶│ & Validation    │
│ 610 images      │    │ 3 prompts × 4    │    │ Syntax + Runtime│
└─────────────────┘    │ models = 12 runs │    └─────────────────┘
                       └──────────────────┘              │
                                                         ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│ Results         │    │ Image Evaluation │    │ Generated       │
│ Analysis        │◀───│ & Comparison     │◀───│ Images          │
│ LaTeX Tables    │    │ Jaccard Index    │    │ PNG Output      │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Supported Fractals

Each fractal is implemented as a standalone .py file in /fractals/:

  1. cantor_set.py - Classic 1D Cantor set with recursive middle-third removal
  2. cantor_dust.py - 2D version of Cantor set using square subdivision
  3. koch_curve.py - Koch curve and snowflake with equilateral triangle bumps
  4. koch_snowflake.py - Standalone Koch snowflake implementation
  5. sierpinski_gasket.py - Sierpiński triangle with recursive subdivision
  6. sierpinski_carpet.py - 2D Sierpiński carpet with square grid subdivision
  7. sierpinski_pentagon.py - Pentagonal version using golden ratio scaling
  8. heighway_dragon.py - Space-filling dragon curve from paper folding
  9. levy_dragon.py - Lévy dragon with 45° turn patterns
  10. mcworter_pentigree.py - Tree-like fractal with 5-way branching
  11. pythagoras_tree.py - Recursive squares forming tree structure
  12. symmetric_binary_tree.py - Binary tree with symmetric branches

Installation & Dependencies

Prerequisites

  • Python 3.10+
  • uv (recommended) or pip
  • OpenRouter API key (for LLM evaluation)

Quick Setup

git clone https://github.com/NaiveNeuron/FractalBench.git
cd FractalBench
uv sync

Environment Configuration

# Required for LLM evaluation pipeline
export OPENROUTER_API_KEY='your_key_here'

# Optional configuration
export MAX_CONCURRENT_REQUESTS=5    # API rate limiting
export OUTPUT_DIR='custom_output'   # Custom output directory

Usage Modes

Educational Use: Individual Fractals

# Generate single fractals
uv run python fractals/sierpinski_gasket.py   # Creates sierpinski_gasket.png
uv run python fractals/koch_snowflake.py      # Creates koch_snowflake.png
uv run python fractals/heighway_dragon.py     # Creates heighway_dragon.png

# All fractals in test set format
uv run python scripts/generate_test_set.py    # Creates data/test_set/ with 610 images

Research Use: Complete Evaluation Pipeline

# Step 1: Generate reference dataset (610 images, ~5 minutes)
uv run python scripts/generate_test_set.py

# Step 2: LLM analysis (3 prompts × 4 models, ~45-90 minutes)
uv run python scripts/analyze_fractals.py

# Step 3: Execute generated code (~10 minutes)
uv run python scripts/execute_llm_code.py

# Step 4: Pixel-level evaluation (~5 minutes)
uv run python scripts/evaluate_images.py

# Step 5: Statistical analysis and LaTeX output (~1 minute)
uv run python scripts/generate_latex_tables.py

Supported Models & Prompts

LLM Models

Model Provider Cost (1M tokens) Capability
GPT-4o OpenAI ~$15 Best vision understanding
Claude 3.7 Sonnet Anthropic ~$3 Strong reasoning & code
Gemini 2.5 Flash Google ~$0.30 Cost-effective
Qwen 2.5-VL Qwen Free Open-source model

Prompt Strategies

  • direct_code: Skip reasoning, generate code directly from image
  • reason_then_code: Analyze fractal structure, then implement
  • recursive_focus: Emphasize recursive patterns and base cases

Cost Estimation

  • Full Evaluation (610 images × 3 prompts): $8-25 depending on model choice
  • Free Tier Available: Qwen model provides no-cost evaluation option

Evaluation Pipeline Details

Stage 1: Test Set Generation

uv run python scripts/generate_test_set.py
  • Output: 610 images (122 per color × 5 colors)
  • Organization: data/test_set/{color}/{fractal}_{depth}_{size}.png
  • Features: Progressive depth sequences, automatic size optimization
  • Time: ~5 minutes

Stage 2: LLM Analysis

uv run python scripts/analyze_fractals.py
  • Input: Select prompt strategy and model(s)
  • Processing: Concurrent API calls with rate limiting
  • Output: data/analysis_results/{prompt}/{model}/{color}/{fractal}.json
  • Features: Progress tracking, error handling, usage statistics
  • Time: 45-90 minutes (varies by model and concurrency)

Stage 3: Code Execution

uv run python scripts/execute_llm_code.py
  • Function: Extract Python code from LLM responses, make executable
  • Safety: Syntax validation, sandboxed execution
  • Output: data/generated_images/ with rendered fractal images
  • Features: Error logging, execution time tracking
  • Time: ~10 minutes

Stage 4: Image Evaluation

uv run python scripts/evaluate_images.py
  • Method: Pixel-level comparison using Jaccard Index (IoU)
  • Processing: Binary image conversion, similarity calculation
  • Output: results/evaluation_results.json with detailed metrics
  • Metrics: Success rates, similarity scores, failure analysis
  • Time: ~5 minutes

Stage 5: Code Complexity Analysis

Analyzes structural complexity of LLM-generated code across various metrics (LOC, turtle calls, function definitions, loops, conditionals, etc.).

uv run python scripts/analyze_code_complexity.py
  • Output: JSON summaries + PDF visualizations in results/code_complexity/
  • Metrics: Lines of code, function calls, control flow structures, etc.
  • Time: ~2-5 minutes

Stage 6: Statistical Analysis

uv run python scripts/generate_latex_tables.py
  • Analysis: Model comparison, prompt effectiveness, fractal difficulty
  • Output: LaTeX tables, statistical visualizations
  • Features: Significance testing, distribution analysis
  • Formats: Ready for research papers and presentations
  • Time: ~1 minute

Directory Structure

After running the complete pipeline, your directory will contain:

FractalBench/
├── fractals/                    # 12 fractal implementations
│   ├── sierpinski_gasket.py     # Triangle fractal
│   ├── koch_snowflake.py        # Classic snowflake
│   └── ...
├── utils/                       # Core infrastructure
│   ├── minimal_turtle.py        # Turtle graphics implementation
│   ├── minimal_renderer.py      # PNG rendering
│   └── fractal_props.py         # Depth calculation
├── scripts/                     # Pipeline scripts
│   ├── generate_test_set.py     # Reference image generation
│   ├── analyze_fractals.py      # LLM analysis
│   ├── execute_llm_code.py      # Code execution
│   ├── evaluate_images.py       # Image comparison
│   ├── analyze_code_complexity.py # Code metrics
│   ├── generate_latex_tables.py # Statistical output
│   └── demo.py                  # Demo fractal gallery
├── static/                      # Static assets
│   └── fractal_gallery.png      # Overview image
├── data/                        # All benchmark data
│   ├── test_set/                # Reference images (610 total)
│   │   ├── black/               # Black fractals
│   │   ├── blue/                # Blue fractals
│   │   └── ...                  # 5 colors total
│   ├── analysis_results/        # LLM-generated code
│   │   ├── direct_code/         # Direct prompt strategy
│   │   ├── reason_then_code/    # Reasoning prompt strategy
│   │   └── recursive_focus/     # Recursive prompt strategy
│   └── generated_images/        # LLM-rendered images
└── results/                     # Evaluation outputs
    ├── evaluation_results.json  # Similarity metrics
    ├── latex_tables.tex         # Research output
    └── code_complexity/         # Code complexity analysis
        ├── by_prompt/           # Group by prompt type
        ├── by_fractal/          # Group by fractal type
        └── *_summary.json       # Complete metrics

Citation

If you use FractalBench in your research, please cite:

@inproceedings{ondras2025fractalbench,
  title={FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis},
  author={Jan Ondras and Marek Suppa},
  booktitle={The 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025},
  year={2025},
  url={https://openreview.net/forum?id=DxsvO2iHnz}
}

License

MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published