Papers
Topics
Authors
Recent
2000 character limit reached

FractalBench: Diagnostic Benchmark Suite

Updated 16 November 2025
  • FractalBench is a benchmark suite and methodology that evaluates visual-mathematical reasoning in AI and optimizes numerical algorithms on fractal point sets.
  • It employs recursive program synthesis and dimension-tunable fractal generation to isolate core symbolic and computational capabilities.
  • Empirical analysis reveals limitations in multi-branch recursion in MLLMs and supports cost-efficient parameter tuning in adaptive Fast Multipole Methods.

FractalBench is both a benchmark suite and a methodology for diagnosing visual-mathematical reasoning in AI systems and optimizing numerical algorithms on fractal sets. The term encompasses two distinct applications: (1) probing multimodal LLMs for symbolic abstraction of fractal structure via recursive program synthesis from imagery (Ondras et al., 9 Nov 2025); (2) generating dimension-tunable fractal point sets and measurement protocols for adaptive Fast Multipole Method (FMM) scalability studies (Pouransari et al., 2015). In both contexts, FractalBench exploits the mathematical properties of fractals—such as exact self-similarity and scale invariance—to isolate core reasoning or computational capabilities without confounds from natural-image complexity or uniform point clouds.

1. Foundations: Fractals as Analytical and Diagnostic Objects

Fractals are self-similar constructs defined via Iterated Function Systems (IFS), a finite set of contraction maps wi(x)=Aix+biw_i(x) = A_i x + b_i with AiA_i a matrix of spectral radius <1<1 and bib_i a translation vector. The attractor KK of an IFS satisfies K=i=1nwi(K)K = \bigcup_{i=1}^n w_i(K) (Hutchinson operator), representing the infinite recursive process manifested by finite-level numerical or visual evidence. In benchmark construction, fractals provide recursive, compositional, and branching structures that challenge both symbolic inference and hierarchical computation.

For adaptive numerical analysis, fractal point sets such as the generalized Cantor set, Cantor dust (Cγ3C_\gamma^3), and Menger sponge span dimensionality dH(0,3)d_H \in (0,3), enabling complexity scaling studies in spatially non-uniform domains (Pouransari et al., 2015). Cantor set construction depends on a gap ratio parameter γ\gamma,

dH=log2log(1γ2),d_H = -\frac{\log 2}{\log\left(\frac{1-\gamma}{2}\right)},

with the three-dimensional extension dH(Cγ3)=3dH(Cγ)d_H(C_\gamma^3) = 3\,d_H(C_\gamma).

2. Benchmark Suite Construction: Visual and Point-Set Paradigms

FractalBench (Ondras et al., 9 Nov 2025) for multimodal LLMs (MLLMs) comprises 610 images (PNG, 1024×10241024 \times 1024 px, 128 DPI) across 12 canonical fractals, rendered at multiple recursion depths and five line-color variants (black, red, blue, green, purple). This contamination-resistant design prevents trivial memorization of fractal forms. The images’ naming convention encodes fractal type, parameters, and color. Executable Python code must be synthesized to reproduce the visual fractal using the MinimalTurtle graphics library, which provides elementary movement and drawing primitives.

For FMM benchmarking (Pouransari et al., 2015), FractalBench includes point-set generators for fractals of tunable Hausdorff dimension. The construction follows:

  • Interval decomposition, e.g., Cantor set, via iterated removal of gaps.
  • N=2kN = 2^k samples at recursion depth kk.
  • 3D distributions via product sets (Cγ3C_\gamma^3). A stepwise recipe (with pseudocode) enables systematic generation, adaptive octree construction, parameter tuning, and cost measurement.

3. Evaluation Protocols and Performance Metrics

Visual-Mathematical Reasoning (MLLMs)

Each MLLM is tasked to generate code that, when run in a sandbox, reconstructs the input fractal image at full resolution. Prompt strategies:

  • Direct Code Generation (DCG): raw image-to-code mapping.
  • Reasoning Then Code (RTC): explicit textual analysis step.
  • Recursive Structure Focus (RSF): explicit base case and parameter update prompts.

Metrics:

  • Runnable%: proportion of syntactically valid, error-free code.
  • Accuracy%: semantic correctness among runnable outputs, judged by Intersection over Union (IoU) 95%\geq 95\%.
  • Overall Success%: product of Runnable% and Accuracy%.

Adaptive FMM Scalability (Numerical Algorithms)

FMM complexity is dictated by the distribution’s fractal dimension. FractalBench’s empirical protocol records timings of key FMM operators (P2M, M2L, L2P, P2P, etc.) under varying tree depth and point-per-box thresholds:

  • Single-threshold: stop subdivision when boxes have t\leq t points.
  • Double-threshold: subdivide only if depth<m2\mathrm{depth}<m_2 and #points>m1\#\mathrm{points}>m_1.

Optimal values and scaling laws for loptl_{\mathrm{opt}} (tree levels) and cost are directly modeled:

loptlog2NdH0.5,l_{\mathrm{opt}} \approx \frac{\log_2 N}{d_H} - 0.5,

logCOSToptlogN+1.44dH+const.\log \mathrm{COST}_{\mathrm{opt}} \approx \log N + 1.44\, d_H + \text{const}.

4. Empirical Outcomes and Analysis

AI Reasoning Performance

Across 7,320 runs (610 images, 5 colors, 3 prompts, 4 models), 76.1% of programs were runnable but only 4.2% visually correct. Model comparison (DCG prompt):

  • Gemini 2.5 Flash: 23.8%23.8\% run, 48.3%48.3\% accuracy 11.5%\rightarrow 11.5\% overall.
  • GPT-4o: 94.3%94.3\% run, 9.6%9.6\% accuracy 9.0%\rightarrow 9.0\% overall.
  • Claude 3.7 Sonnet: 82.0%82.0\% run, 9.0%9.0\% accuracy 7.4%\rightarrow 7.4\% overall.
  • Qwen 2.5-VL: 99.2%99.2\% run, 3.3%3.3\% accuracy 3.3%\rightarrow 3.3\% overall.

Task type breakdown:

  • Koch curve/snowflake: 1721%17-21\% accuracy (iterative geometric transformations).
  • Sierpinski structures: 318%3-18\%.
  • Cantor sets/dust: 36%3-6\% (linear recursion).
  • Dragon curves: 1.61.9%\sim 1.6-1.9\%.
  • Branching trees (Pythagoras, binary, McWorter, pentigree): <2%<2\%, exposing inability to capture multi-branching recursion.

Prompting for explicit reasoning disrupts parameter inference (DCG\mathrm{DCG} outperforms RTC\mathrm{RTC} and RSF\mathrm{RSF}).

FMM Numerical Performance

For fractal sets dH=1.0,2.0,3.0d_H=1.0,2.0,3.0 and NN up to 10610^6:

  • Double-threshold rule reduces total FMM cycle cost by 2530%\sim 25–30\% for dH[1,2]d_H \in [1,2], negligible for dH3d_H \rightarrow 3.
  • All cost curves linear in NN (cycles per point constant).
  • P2P vs. M2L operator share 50:50\sim 50:50 at optimum parameter values.
  • Modifications avoiding expansion storage on singleton chains prevent quadratic scaling on degenerate adaptive trees.
Fractal dHd_H Method loptl_\mathrm{opt} tmaxt_\mathrm{max} Cost (Gcycles)
1.0 single-thr 19 30 0.48
1.0 double-thr 20 1 0.36
2.0 single-thr 10 100 0.62
2.0 double-thr 11 1 0.45
3.0 both 8 1024 0.85

5. Mathematical and Algorithmic Implications

FractalBench demonstrates that simple geometric abstraction (rotations, scalings) is increasingly accessible to current MLLMs, yet true recursive abstraction, particularly branching recursion across independent state, remains unsolved in practice. Models frequently substitute iterations or single-branch recursion in place of multi-branch IFS patterns, with accumulated numerical errors in parameters (angle, scale) causing exponential divergence from correct structure. This suggests a fundamental gap in compositional mathematical reasoning that persists across prompting strategies.

In numerical algorithms, systematic control of fractal dimension and subdivision heuristics (double-threshold) enables optimal O(N) complexity for adaptive FMM on point distributions far from uniform or manifold-like. Explicit dimension-dependent parameter tuning, rigorously modeled and empirically validated, is essential for efficient performance.

6. Reproducibility and Resources

FractalBench for both reasoning and numerical experimentation is open-source:

  • Visual-Mathematical Reasoning Benchmark: 610 images, MinimalTurtle graphics library, prompt templates, execution and evaluation scripts.
  • Requirements: Python 3.9+, numpy, pillow. Evaluation invoked via
    1
    
    python evaluate.py --model <model_api> --prompt DCG
    with usage instructions in the repository README.
  • For adaptive FMM analysis, point set generation and parameter tuning protocols are provided alongside pseudocode specifying interval decomposition, octree construction, and cost measurement.

7. Future Directions and Research Opportunities

Advancing visual-mathematical reasoning requires integration of domain-specific geometric inference tools, specialized training objectives, and structure-aware evaluation metrics (IFS parameter matching, branch counts, recursion depth) rather than solely pixelwise similarity. Hybrid neuro-symbolic synthesis—explicit parameter search plus neural perception—may augment current approaches. In numerical analysis, extension to other fractal types, aggregation kernels, and further parameter space exploration remains open for systematic paper.

A plausible implication is that FractalBench, by bridging symbolic reasoning and scale-invariant numerical modeling, provides a unified testbed for measuring abstraction in both cognitive AI and adaptive algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FractalBench.