FractalBench: Recursive Program Synthesis Benchmark
- FractalBench is a benchmark that uses Iterated Function Systems (IFS) to generate fractal images via recursive program synthesis from visual inputs.
- It evaluates models on syntactic validity, functional correctness, and abstraction by measuring execution success, IoU, and recursion depth.
- The design bridges visual perception with mathematical abstraction, exposing limitations in multi-scale inference and branch-specific recursive reasoning.
FractalBench is a contamination-resistant evaluation benchmark designed to diagnose the visual-mathematical reasoning capabilities of large multimodal LLMs (MLLMs) through the task of recursive program synthesis from images. It targets the core challenge of bridging visual perception with mathematical abstraction by requiring models to infer executable recursive rules that generate complex fractal structures, given only synthetic images as input. The benchmark leverages the theory of Iterated Function Systems (IFS), a formalism for fractal generation, to create objective, code-based test cases that assess models on symbolic generalization, semantic correctness, and abstraction, with a particular focus on their handling of self-similarity and branching recursion (Ondras et al., 9 Nov 2025).
1. Formal Basis: Iterated Function Systems and Fractal Program Synthesis
FractalBench is grounded in the mathematical formalism of Iterated Function Systems (IFS). An IFS on is a finite set of contractive similarities , each of the form
where is the contraction ratio, is a rotation or reflection matrix, and is a translation vector. The contraction property is imposed by
Hutchinson's theorem guarantees the existence of a unique nonempty compact attractor
representing the fractal set. The practical construction of fractals utilizes recursive application of these maps, which may also be sampled stochastically via the "chaos game" algorithm.
Synthesis of recursive programs for fractals requires models to recognize and implement the underlying IFS structure given only rendered images, thus explicitly testing their ability to "abstract symbolic rules from visual patterns," a central problem in mathematical AI (Ondras et al., 9 Nov 2025).
2. Canonical Fractal Types and Mathematical Specifications
FractalBench defines twelve canonical fractals, grouped by mathematical type and characterized by their explicit IFS maps:
| Type | Examples | Characteristic Maps |
|---|---|---|
| Cantor-type | Cantor set, Cantor dust | , , etc. |
| Koch-type | Koch curve, Koch snowflake | Four maps with , including rotation by |
| Sierpiński-type | Gasket, carpet, pentagon | Gasket: |
| Dragon-curves | Heighway dragon, Lévy C-curve | Maps with , rotations by |
| Tree fractals | McWorter’s pentigree, Pythagoras, symmetric binary | Recursive, often branching structure with varying , |
Each fractal is parameterized so that image features remain visually resolvable at all recursion depths. For example, for a chosen minimal resolvable feature size and scale factor , the maximal drawing depth is given by
This ensures evaluated images fully exercise both the perceptual and abstraction limits of models across a spectrum of self-similar and branching structures.
3. Dataset Construction, Task Protocol, and Prompting Strategies
The benchmark comprises 610 unique images (122 per fractal, in five color variants) rendered at 1024×1024px, 128 DPI, and varying in recursion depth to ensure feature clarity. The color randomization protocol prohibits models from exploiting canonical color patterns, thus ensuring "zero-contamination" evaluation.
Each test instance presents a single image, with three prompt styles:
- Direct Code Generation (DCG): explicitly requests Python code.
- Reasoning then Code (RTC): requires natural language derivation followed by code.
- Recursive Structure Focus (RSF): emphasizes recursion, base cases, and parameter updates.
All generated code must utilize a restricted MinimalTurtle graphics API and produce a self-contained script of the form:
1 2 3 |
if __name__=='__main__': turtle = your_fractal(depth=…,size=…) render_turtle(turtle,'out.png') |
4. Objective Evaluation Metrics and Structural Diagnostics
Model outputs are scored by strict black-box criteria:
- Syntactic validity (Run %): percentage of scripts that execute successfully within 30 seconds in a controlled sandbox.
- Functional correctness (Accuracy, Acc %): for runnable code, fraction of outputs matching ground truth images with Intersection over Union
- Overall success: product of Run % and Acc %.
Additionally, IoU distributions are stratified by fractal family, and generated code is analyzed for recursion depth, code complexity, and the presence of correct geometric and recursive abstractions (e.g., scale invariance, branching logic).
5. Empirical Findings and Fractal-Type Hierarchy
Extensive evaluation of four state-of-the-art MLLMs—Gemini 2.5 Flash, Claude 3.7 Sonnet, Qwen 2.5-VL, and GPT-4o—shows a pronounced gap between basic code generation capabilities and genuine mathematical abstraction:
- Syntactic validity: 76% of scripts run successfully.
- Semantic correctness: Only 4% achieve IoU ≥ 0.95, capturing true fractal structure.
Aggregated performance metrics:
| Model | Overall Accuracy (%) | Mean IoU |
|---|---|---|
| Gemini 2.5 Flash | 14.8 | 0.204 |
| Claude 3.7 Sonnet | 5.2 | 0.095 |
| Qwen 2.5-VL | 5.0 | 0.082 |
| GPT-4o | 3.5 | 0.088 |
Fractal family analysis reveals a clear hierarchy of difficulty:
- Highest success: Koch snowflake (20.8%), Sierpiński carpet (18.5%), Koch curve (17.2%).
- Moderate: Sierpiński pentagon (8.5%), Cantor dust (6.4%).
- Lowest: Tree-like structures (McWorter’s pentigree: 4.8%, Pythagoras tree: 1.8%, symmetric binary tree: 0.9%), dragon curves (Heighway: 1.6%, Lévy: 1.9%).
Systematic outcome patterns include:
- Success for local geometric transformation: Models outperform on fractals characterized by local similarity with simple scaling and rotation.
- Failure for multi-scale abstraction and branching recursion: Most models fail to infer correct recursion depth, parameter passing, or maintain independent state across branches, often degenerating to linear or hardcoded motifs.
- Prompting effects: Direct code generation (DCG) prompts outperform reasoning-based (RTC, RSF), suggesting that explicit chain-of-thought may actually disrupt the focus needed for correct geometric coding.
- Abstraction phase transition: Across recursion depths, there is an abrupt code length reduction when models discover recursive patterns, marking a threshold in abstraction ability.
6. Limitations, Diagnostic Value, and Comparative Analysis
FractalBench highlights a pronounced 76% to 4% syntactic-semantic gap in current MLLMs, emphasizing that the ability to combine local geometric operations does not entail true mathematical abstraction or recursive synthesis. The contamination-resistant design, strict functional metrics, and detailed structural diagnostics provide a unique lens through which to characterize the boundaries of model reasoning.
Notable limitations uncovered include:
- Inability to synthesize branch-specific recursion or multi-agent state without explicit tree-structured computation graphs.
- Systematic mis-estimation of geometric parameters (scale factor , rotation angle ) in multi-scale settings.
- General difficulty mapping from visual self-similarity to symbolic generative rules purely from raw images.
FractalBench further demonstrates that code complexity and prompt engineering alone cannot compensate for underlying abstraction deficits in current MLLMs.
7. Implications for Recursive Program Synthesis and Directions for Future Research
The benchmark points toward a pressing research agenda in program synthesis and symbolic reasoning for multimodal AI:
- Tree-structured computation: Incorporation of architectures or modules explicitly maintaining independent state across recursive branches.
- Robust geometric inference: Improved model components for parameter induction from visual data.
- Hybrid neuro-symbolic pipelines: Joint approaches that fuse deep perception with explicit IFS or L-system discovery mechanisms.
- Refined diagnostics: Development of finer-grained metrics (e.g., branch-count accuracy, angle-matching) to inform both model training and evaluation.
These conclusions extend beyond fractals to other areas involving symbolic rule inference from data, including L-systems, formal languages, and branching physical processes. FractalBench, by offering robust, contamination-resistant, and strictly quantified diagnostics, delineates clear limitations in current models and prescribes a trajectory for the realization of abstract visual-mathematical reasoning systems (Ondras et al., 9 Nov 2025).