FractalBench: Recursive Program Synthesis Benchmark

Updated 31 March 2026

FractalBench is a benchmark that uses Iterated Function Systems (IFS) to generate fractal images via recursive program synthesis from visual inputs.
It evaluates models on syntactic validity, functional correctness, and abstraction by measuring execution success, IoU, and recursion depth.
The design bridges visual perception with mathematical abstraction, exposing limitations in multi-scale inference and branch-specific recursive reasoning.

FractalBench is a contamination-resistant evaluation benchmark designed to diagnose the visual-mathematical reasoning capabilities of large multimodal LLMs (MLLMs) through the task of recursive program synthesis from images. It targets the core challenge of bridging visual perception with mathematical abstraction by requiring models to infer executable recursive rules that generate complex fractal structures, given only synthetic images as input. The benchmark leverages the theory of Iterated Function Systems (IFS), a formalism for fractal generation, to create objective, code-based test cases that assess models on symbolic generalization, semantic correctness, and abstraction, with a particular focus on their handling of self-similarity and branching recursion (Ondras et al., 9 Nov 2025).

1. Formal Basis: Iterated Function Systems and Fractal Program Synthesis

FractalBench is grounded in the mathematical formalism of Iterated Function Systems (IFS). An IFS on $\mathbb{R}^d$ is a finite set of contractive similarities $f_i : \mathbb{R}^d \rightarrow \mathbb{R}^d$ , each of the form

$x_{t+1} = f_i(x_t) = s_i R_i x_t + t_i$

where $0 < s_i < 1$ is the contraction ratio, $R_i \in SO(d)$ is a rotation or reflection matrix, and $t_i \in \mathbb{R}^d$ is a translation vector. The contraction property is imposed by

$\|f_i(x) - f_i(y)\| \leq s_i \|x - y\| < \|x - y\| \quad \forall x, y.$

Hutchinson's theorem guarantees the existence of a unique nonempty compact attractor

$K = \bigcup_{i=1}^m f_i(K),$

representing the fractal set. The practical construction of fractals utilizes recursive application of these maps, which may also be sampled stochastically via the "chaos game" algorithm.

Synthesis of recursive programs for fractals requires models to recognize and implement the underlying IFS structure given only rendered images, thus explicitly testing their ability to "abstract symbolic rules from visual patterns," a central problem in mathematical AI (Ondras et al., 9 Nov 2025).

2. Canonical Fractal Types and Mathematical Specifications

FractalBench defines twelve canonical fractals, grouped by mathematical type and characterized by their explicit IFS maps:

Type	Examples	Characteristic Maps
Cantor-type	Cantor set, Cantor dust	$f_1(x) = \frac{1}{3}x$ , $f_2(x) = \frac{1}{3}x+\frac{2}{3}$ , etc.
Koch-type	Koch curve, Koch snowflake	Four maps with $r=1/3$ , including rotation by $\pm\pi/3$
Sierpiński-type	Gasket, carpet, pentagon	Gasket: $f_i(x) = \frac{1}{2}(x-v_i)+v_i$
Dragon-curves	Heighway dragon, Lévy C-curve	Maps with $r=1/\sqrt{2}$ , rotations by $\pm\pi/4$
Tree fractals	McWorter’s pentigree, Pythagoras, symmetric binary	Recursive, often branching structure with varying $r$ , $\theta$

Each fractal is parameterized so that image features remain visually resolvable at all recursion depths. For example, for a chosen minimal resolvable feature size $s_0$ and scale factor $r$ , the maximal drawing depth $d_{max}$ is given by

$d_{max} = \left\lfloor \frac{\ln(1/s_0)}{\ln(r)} \right\rfloor.$

This ensures evaluated images fully exercise both the perceptual and abstraction limits of models across a spectrum of self-similar and branching structures.

3. Dataset Construction, Task Protocol, and Prompting Strategies

The benchmark comprises 610 unique images (122 per fractal, in five color variants) rendered at 1024×1024px, 128 DPI, and varying in recursion depth to ensure feature clarity. The color randomization protocol prohibits models from exploiting canonical color patterns, thus ensuring "zero-contamination" evaluation.

Each test instance presents a single image, with three prompt styles:

Direct Code Generation (DCG): explicitly requests Python code.
Reasoning then Code (RTC): requires natural language derivation followed by code.
Recursive Structure Focus (RSF): emphasizes recursion, base cases, and parameter updates.

All generated code must utilize a restricted MinimalTurtle graphics API and produce a self-contained script of the form:

1
2
3

if __name__=='__main__':
    turtle = your_fractal(depth=…,size=…)
    render_turtle(turtle,'out.png')

No external geometry or drawing libraries are permitted. The total dataset comprises 7320 model runs across four MLLMs, three prompt styles, and five color variants.

4. Objective Evaluation Metrics and Structural Diagnostics

Model outputs are scored by strict black-box criteria:

Syntactic validity (Run %): percentage of scripts that execute successfully within 30 seconds in a controlled sandbox.
Functional correctness (Accuracy, Acc %): for runnable code, fraction of outputs matching ground truth images with Intersection over Union

$\mathrm{IoU} = \frac{|\mathcal{B}_{\mathrm{gt}} \cap \mathcal{B}_{\mathrm{out}}|}{|\mathcal{B}_{\mathrm{gt}} \cup \mathcal{B}_{\mathrm{out}}|} \geq 0.95.$

Overall success: product of Run % and Acc %.

Additionally, IoU distributions are stratified by fractal family, and generated code is analyzed for recursion depth, code complexity, and the presence of correct geometric and recursive abstractions (e.g., scale invariance, branching logic).

5. Empirical Findings and Fractal-Type Hierarchy

Extensive evaluation of four state-of-the-art MLLMs—Gemini 2.5 Flash, Claude 3.7 Sonnet, Qwen 2.5-VL, and GPT-4o—shows a pronounced gap between basic code generation capabilities and genuine mathematical abstraction:

Syntactic validity: 76% of scripts run successfully.
Semantic correctness: Only 4% achieve IoU ≥ 0.95, capturing true fractal structure.

Aggregated performance metrics:

Model	Overall Accuracy (%)	Mean IoU
Gemini 2.5 Flash	14.8	0.204
Claude 3.7 Sonnet	5.2	0.095
Qwen 2.5-VL	5.0	0.082
GPT-4o	3.5	0.088

Fractal family analysis reveals a clear hierarchy of difficulty:

Highest success: Koch snowflake (20.8%), Sierpiński carpet (18.5%), Koch curve (17.2%).
Moderate: Sierpiński pentagon (8.5%), Cantor dust (6.4%).
Lowest: Tree-like structures (McWorter’s pentigree: 4.8%, Pythagoras tree: 1.8%, symmetric binary tree: 0.9%), dragon curves (Heighway: 1.6%, Lévy: 1.9%).

Systematic outcome patterns include:

Success for local geometric transformation: Models outperform on fractals characterized by local similarity with simple scaling and rotation.
Failure for multi-scale abstraction and branching recursion: Most models fail to infer correct recursion depth, parameter passing, or maintain independent state across branches, often degenerating to linear or hardcoded motifs.
Prompting effects: Direct code generation (DCG) prompts outperform reasoning-based (RTC, RSF), suggesting that explicit chain-of-thought may actually disrupt the focus needed for correct geometric coding.
Abstraction phase transition: Across recursion depths, there is an abrupt code length reduction when models discover recursive patterns, marking a threshold in abstraction ability.

6. Limitations, Diagnostic Value, and Comparative Analysis

FractalBench highlights a pronounced 76% to 4% syntactic-semantic gap in current MLLMs, emphasizing that the ability to combine local geometric operations does not entail true mathematical abstraction or recursive synthesis. The contamination-resistant design, strict functional metrics, and detailed structural diagnostics provide a unique lens through which to characterize the boundaries of model reasoning.

Notable limitations uncovered include:

Inability to synthesize branch-specific recursion or multi-agent state without explicit tree-structured computation graphs.
Systematic mis-estimation of geometric parameters (scale factor $r$ , rotation angle $\theta$ ) in multi-scale settings.
General difficulty mapping from visual self-similarity to symbolic generative rules purely from raw images.

FractalBench further demonstrates that code complexity and prompt engineering alone cannot compensate for underlying abstraction deficits in current MLLMs.

7. Implications for Recursive Program Synthesis and Directions for Future Research

The benchmark points toward a pressing research agenda in program synthesis and symbolic reasoning for multimodal AI:

Tree-structured computation: Incorporation of architectures or modules explicitly maintaining independent state across recursive branches.
Robust geometric inference: Improved model components for parameter induction from visual data.
Hybrid neuro-symbolic pipelines: Joint approaches that fuse deep perception with explicit IFS or L-system discovery mechanisms.
Refined diagnostics: Development of finer-grained metrics (e.g., branch-count accuracy, angle-matching) to inform both model training and evaluation.

These conclusions extend beyond fractals to other areas involving symbolic rule inference from data, including L-systems, formal languages, and branching physical processes. FractalBench, by offering robust, contamination-resistant, and strictly quantified diagnostics, delineates clear limitations in current models and prescribes a trajectory for the realization of abstract visual-mathematical reasoning systems (Ondras et al., 9 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FractalBench (Recursive Program Synthesis).