FractalBench: Diagnostic Benchmark Suite
- FractalBench is a benchmark suite and methodology that evaluates visual-mathematical reasoning in AI and optimizes numerical algorithms on fractal point sets.
- It employs recursive program synthesis and dimension-tunable fractal generation to isolate core symbolic and computational capabilities.
- Empirical analysis reveals limitations in multi-branch recursion in MLLMs and supports cost-efficient parameter tuning in adaptive Fast Multipole Methods.
FractalBench is both a benchmark suite and a methodology for diagnosing visual-mathematical reasoning in AI systems and optimizing numerical algorithms on fractal sets. The term encompasses two distinct applications: (1) probing multimodal LLMs for symbolic abstraction of fractal structure via recursive program synthesis from imagery (Ondras et al., 9 Nov 2025); (2) generating dimension-tunable fractal point sets and measurement protocols for adaptive Fast Multipole Method (FMM) scalability studies (Pouransari et al., 2015). In both contexts, FractalBench exploits the mathematical properties of fractals—such as exact self-similarity and scale invariance—to isolate core reasoning or computational capabilities without confounds from natural-image complexity or uniform point clouds.
1. Foundations: Fractals as Analytical and Diagnostic Objects
Fractals are self-similar constructs defined via Iterated Function Systems (IFS), a finite set of contraction maps with a matrix of spectral radius and a translation vector. The attractor of an IFS satisfies (Hutchinson operator), representing the infinite recursive process manifested by finite-level numerical or visual evidence. In benchmark construction, fractals provide recursive, compositional, and branching structures that challenge both symbolic inference and hierarchical computation.
For adaptive numerical analysis, fractal point sets such as the generalized Cantor set, Cantor dust (), and Menger sponge span dimensionality , enabling complexity scaling studies in spatially non-uniform domains (Pouransari et al., 2015). Cantor set construction depends on a gap ratio parameter ,
with the three-dimensional extension .
2. Benchmark Suite Construction: Visual and Point-Set Paradigms
FractalBench (Ondras et al., 9 Nov 2025) for multimodal LLMs (MLLMs) comprises 610 images (PNG, px, 128 DPI) across 12 canonical fractals, rendered at multiple recursion depths and five line-color variants (black, red, blue, green, purple). This contamination-resistant design prevents trivial memorization of fractal forms. The images’ naming convention encodes fractal type, parameters, and color. Executable Python code must be synthesized to reproduce the visual fractal using the MinimalTurtle graphics library, which provides elementary movement and drawing primitives.
For FMM benchmarking (Pouransari et al., 2015), FractalBench includes point-set generators for fractals of tunable Hausdorff dimension. The construction follows:
- Interval decomposition, e.g., Cantor set, via iterated removal of gaps.
- samples at recursion depth .
- 3D distributions via product sets (). A stepwise recipe (with pseudocode) enables systematic generation, adaptive octree construction, parameter tuning, and cost measurement.
3. Evaluation Protocols and Performance Metrics
Visual-Mathematical Reasoning (MLLMs)
Each MLLM is tasked to generate code that, when run in a sandbox, reconstructs the input fractal image at full resolution. Prompt strategies:
- Direct Code Generation (DCG): raw image-to-code mapping.
- Reasoning Then Code (RTC): explicit textual analysis step.
- Recursive Structure Focus (RSF): explicit base case and parameter update prompts.
Metrics:
- Runnable%: proportion of syntactically valid, error-free code.
- Accuracy%: semantic correctness among runnable outputs, judged by Intersection over Union (IoU) .
- Overall Success%: product of Runnable% and Accuracy%.
Adaptive FMM Scalability (Numerical Algorithms)
FMM complexity is dictated by the distribution’s fractal dimension. FractalBench’s empirical protocol records timings of key FMM operators (P2M, M2L, L2P, P2P, etc.) under varying tree depth and point-per-box thresholds:
- Single-threshold: stop subdivision when boxes have points.
- Double-threshold: subdivide only if and .
Optimal values and scaling laws for (tree levels) and cost are directly modeled:
4. Empirical Outcomes and Analysis
AI Reasoning Performance
Across 7,320 runs (610 images, 5 colors, 3 prompts, 4 models), 76.1% of programs were runnable but only 4.2% visually correct. Model comparison (DCG prompt):
- Gemini 2.5 Flash: run, accuracy overall.
- GPT-4o: run, accuracy overall.
- Claude 3.7 Sonnet: run, accuracy overall.
- Qwen 2.5-VL: run, accuracy overall.
Task type breakdown:
- Koch curve/snowflake: accuracy (iterative geometric transformations).
- Sierpinski structures: .
- Cantor sets/dust: (linear recursion).
- Dragon curves: .
- Branching trees (Pythagoras, binary, McWorter, pentigree): , exposing inability to capture multi-branching recursion.
Prompting for explicit reasoning disrupts parameter inference ( outperforms and ).
FMM Numerical Performance
For fractal sets and up to :
- Double-threshold rule reduces total FMM cycle cost by for , negligible for .
- All cost curves linear in (cycles per point constant).
- P2P vs. M2L operator share at optimum parameter values.
- Modifications avoiding expansion storage on singleton chains prevent quadratic scaling on degenerate adaptive trees.
| Fractal | Method | Cost (Gcycles) | ||
|---|---|---|---|---|
| 1.0 | single-thr | 19 | 30 | 0.48 |
| 1.0 | double-thr | 20 | 1 | 0.36 |
| 2.0 | single-thr | 10 | 100 | 0.62 |
| 2.0 | double-thr | 11 | 1 | 0.45 |
| 3.0 | both | 8 | 1024 | 0.85 |
5. Mathematical and Algorithmic Implications
FractalBench demonstrates that simple geometric abstraction (rotations, scalings) is increasingly accessible to current MLLMs, yet true recursive abstraction, particularly branching recursion across independent state, remains unsolved in practice. Models frequently substitute iterations or single-branch recursion in place of multi-branch IFS patterns, with accumulated numerical errors in parameters (angle, scale) causing exponential divergence from correct structure. This suggests a fundamental gap in compositional mathematical reasoning that persists across prompting strategies.
In numerical algorithms, systematic control of fractal dimension and subdivision heuristics (double-threshold) enables optimal O(N) complexity for adaptive FMM on point distributions far from uniform or manifold-like. Explicit dimension-dependent parameter tuning, rigorously modeled and empirically validated, is essential for efficient performance.
6. Reproducibility and Resources
FractalBench for both reasoning and numerical experimentation is open-source:
- Visual-Mathematical Reasoning Benchmark: 610 images, MinimalTurtle graphics library, prompt templates, execution and evaluation scripts.
- Requirements: Python 3.9+, numpy, pillow. Evaluation invoked via
with usage instructions in the repository README.1
python evaluate.py --model <model_api> --prompt DCG
- For adaptive FMM analysis, point set generation and parameter tuning protocols are provided alongside pseudocode specifying interval decomposition, octree construction, and cost measurement.
7. Future Directions and Research Opportunities
Advancing visual-mathematical reasoning requires integration of domain-specific geometric inference tools, specialized training objectives, and structure-aware evaluation metrics (IFS parameter matching, branch counts, recursion depth) rather than solely pixelwise similarity. Hybrid neuro-symbolic synthesis—explicit parameter search plus neural perception—may augment current approaches. In numerical analysis, extension to other fractal types, aggregation kernels, and further parameter space exploration remains open for systematic paper.
A plausible implication is that FractalBench, by bridging symbolic reasoning and scale-invariant numerical modeling, provides a unified testbed for measuring abstraction in both cognitive AI and adaptive algorithms.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free