Visual Math Benchmarks Overview
- Visual Math Benchmarks are rigorous evaluation protocols that assess multimodal models’ capacity to extract symbolic structure, perform quantitative calculations, and generate code from visual inputs.
- They employ contamination-resistant designs, well-defined tasks, and parameterizable generation pipelines to probe arithmetic, geometric, and compositional reasoning.
- The benchmarks reveal critical challenges in model robustness, including issues with multi-image integration, noisy real-world inputs, and inadequate symbolic abstraction.
Visual Math Benchmarks are a class of evaluation protocols, datasets, and metrics specifically designed to assess the capacity of vision-language and multimodal models to reason mathematically from visual input. These benchmarks go beyond simple visual recognition, probing models’ abilities to extract symbolic structure, perform quantitative calculations, generate code, and fuse visual and textual clues—often under conditions of limited or ambiguous input. The variety of available benchmarks targets distinct cognitive, perceptual, and symbolic reasoning challenges, ranging from K–12 math in real-world scenarios to compositional program synthesis, multi-image context integration, and fine-grained diagram parsing.
1. Benchmark Landscape and Taxonomy
Visual math benchmarks can be classified along several axes, including the nature of mathematical content (arithmetic, geometry, algebra, combinatorics, calculus), the visual modality (static diagram, photo, stylized rendering, scene sequence, video), the reasoning objective (direct answer, multi-step explanation, code synthesis), and the evaluation focus (accuracy, procedural consistency, symbolic abstraction, robustness to variation).
Prominent exemplars include:
- FractalBench: Diagnoses whether models can abstract recursive generative rules (e.g., Iterated Function Systems) from fractal images by requiring pixel-perfect code generation that reconstructs canonical fractals at arbitrary depth (Ondras et al., 9 Nov 2025).
- MathSticks: Probes compositional visual-symbolic reasoning through matchstick puzzles, unifying perception, symbolic manipulation, and arithmetic under strict conservation laws (Ji et al., 1 Oct 2025).
- GSM8K-V: Translates classic text-based arithmetic word problems into multi-scene comic-style images, evaluating multi-image, perception-grounded reasoning (Yuan et al., 29 Sep 2025).
- MaRVL-QA: Focuses on topological and geometric reasoning over mathematical surface plots, imposing tasks such as counting extrema and recognizing global transformations (Pande et al., 24 Aug 2025).
- MathReal, MathScape: Capture the challenges of real-scene (photographed or noisy) school-level problems, enforcing joint understanding of presented text and embedded diagrams (Feng et al., 8 Aug 2025, Zhou et al., 14 Aug 2024).
- We-Math, We-Math 2.0, MM-MATH, MathVerse, MathBookEval: Provide taxonomized, multi-step, process-aware assessment, with layered knowledge decomposition and stepwise evaluation (Qiao et al., 1 Jul 2024, Qiao et al., 14 Aug 2025, Sun et al., 7 Apr 2024, Zhang et al., 21 Mar 2024).
- VisioMath, MV-MATH, VCBENCH, VideoMathQA: Address multi-image, multi-choice, and video-based scenarios to test integration and temporal reasoning (Li et al., 7 Jun 2025, Wang et al., 28 Feb 2025, Wang et al., 24 Apr 2025, Rasheed et al., 5 Jun 2025).
- MathOPEval: Evaluates code-based visual operations—generation, deletion, modification, annotation—on diagrams and plots, using both free-form code and MCQ protocols (Li et al., 24 Jul 2025).
- Kangaroo Math, MATH-V, VisAidMath: Focus on multilingual, competition-grade, or explicit visual-aid mathematical scenarios (Sáez et al., 9 Jun 2025, Wang et al., 22 Feb 2024, Ma et al., 30 Oct 2024).
These benchmarks collectively supply a multidimensional testbed for the field.
2. Core Methodological Principles
Visual math benchmarks implement methodological rigor through:
- Contamination-Resistant Design: Use of color-variant images, multiple recursion depths, non-canonical renderings, and other transformations to prevent retrieval or memorization from pretraining (e.g., FractalBench, MaRVL-QA).
- Formal Task Specification: Precisely defined inputs (e.g., an image, sequence of images, or video frames), outputs (integer, short answer, step sequence, code), and constraints (e.g., only producing code in a strict API, or answering via MCQ with precise option mapping).
- Evaluation Metrics:
| Benchmark | Task | Primary Metric | Unique Metric Feature | |------------------|----------------------|--------------------------------------|----------------------------------------------| | FractalBench | Vision-to-code | IoU ≥ 95% for rendered fractal | Distinguishes code validity vs. structural | | MathSticks | Matchstick VSCR | Accuracy (by level/move) | Symbolic/visual, operator-flip diagnosis | | VCBENCH, MV-MATH | Multi-image MCQ | Accuracy (overall/domain) | Error-type breakdown (visual, logic, etc.) | | VideoMathQA | Video QA | MCQ, step-score (0–10), MBin accuracy| Reasoning-type breakdown (direct/transfer) | | MathOPEval | Code-generation/edit | MCQ, CoT-based code similarity | Four visual-operation sub-tasks | | We-Math | Multi-step | IK, IG, CM, RM (reasoning process) | Allows fine-grained procedural attribution |
- Automatic Judging Pipelines: Use of strong LLMs or custom parsers to extract, normalize, and validate answers and intermediate steps, and to score process-level metrics (e.g., MathVerse, MM-MATH).
- Difficulty and Domain Stratification: Explicit tagging by domain, concept, problem type, and difficulty level, enabling domain-wise performance analysis.
3. Empirical Patterns and Failure Modes
Benchmark results consistently reveal major limitations in current multimodal models:
- Syntactic Competence vs. Mathematical Structure: Models frequently generate syntactically valid, but mathematically incorrect, code (e.g., 76% runnable code vs. 4% structural correctness in FractalBench) (Ondras et al., 9 Nov 2025).
- Perceptual Bottlenecks: Visual perception and parsing errors (e.g., miscounting, misreading labels or axes, poor OCR under real-scene noise) dominate (>40–60% of failures in VCBENCH, MathReal, MM-MATH).
- Poor Generalization: Systematic performance drops arise under:
- Multi-step chains (accuracy falls steeply with more reasoning steps, see We-Math, MathBookEval).
- Multi-image or multi-scene inputs (∼40 percentage point drop in VCBENCH when multi-image instead of merged).
- Minor diagrammatic or textual variants (DynaMath, where worst-case accuracy is less than half the average-case).
- Inadequate Symbolic Abstraction: Branching recursion in program synthesis, compositional edits (MathSticks), and function-graph operations remain especially challenging.
- Neglect of Visual-Modal Input: Ablation studies show that models often perform comparably—or even better—on text-only variants (MathSight, MathVerse). This suggests that current architectures may circumvent diagram understanding by relying on linguistic priors or redundant text.
- Process Disintegration: Reasoning chains often derail at first perceptual or logic step. Stepwise attribution reveals models may correctly answer composite problems but fail on the required sub-concepts (We-Math; high Rote Memorization).
4. Design Innovations and Contamination Controls
Leading benchmarks implement the following design innovations to ensure faithful measurement of visual-mathematical reasoning:
- Metalabels for Ambiguity and Dependency: Explicit labels for image dependence (mutually dependent vs. independent images; MV-MATH), visual noise categorization (MathReal), or visual-aid requirements (VisAidMath).
- Parameterizable Generation Pipelines: Automated rendering of variants under programmatic control, enabling robustness/consistency probing (DynaMath, MaRVL-QA, FractalBench).
- Process-Oriented Ground Truth: Annotation of every solution step with knowledge-point mappings; scoring both answer and derivation (MathBookEval, MM-MATH).
- Cross-Modality Isolation: Multi-version problem sets removing or embedding different text/visual conditions (MathVerse; vision-only, text-only, vision-intensive).
- Multilingual and Cross-Dataset Experiments: Kangaroo Math (Sáez et al., 9 Jun 2025), MATH-V (Wang et al., 22 Feb 2024) provide cross-lingual, competition-grade evaluation with language-matched prompting.
5. Benchmark Impact and Community Directions
These benchmarks are shaping research and practical directions in multiple ways:
- Driving Model Development: Results have motivated vision-centric pretraining, architecture modifications for enhanced visual-token parsing, and multi-image context fusion modules.
- Revealing Critical Gaps: The scale and persistence of model error relative to humans—even under “elementary school” settings and with pure visual reasoning—prove that current SOTA models fall well short of robust diagram understanding, visual compositionality, or true symbol grounding.
- Procedural/Process Supervision: Recent protocols encourage feedback on explicit derivational steps or intermediate representations (e.g., CoT scoring, chain-of-visual-operations), rather than answer-only supervision.
- Preventing Shortcuts: Removal or shuffling of redundant descriptive text, contamination-resistant colormaps, and construction of unseen variants ensure results reflect genuine abstraction, not memorized regularities.
- Extensible, Modular Design: Public releases of code, images, and evaluation scripts with detailed documentation (e.g., FractalBench, MathSticks, VisAidMath, DynaMath) facilitate ongoing community benchmarking and extension.
6. Challenges, Limitations, and Future Research
Despite substantial progress, several universal challenges remain:
- Scarcity of Realistic, Noisy Data: Most datasets rely on synthetic or clean visualizations; MathReal and MathScape demonstrate that authentic, handheld captures introduce highly nontrivial perception difficulties (Feng et al., 8 Aug 2025, Zhou et al., 14 Aug 2024).
- Robustness to Visual Variation: Systematic study of model instability under parametrized variations is new but essential (DynaMath).
- Multi-step, Multi-modal Integration: VideoMathQA is among the first to require temporal reasoning over long video segments, pushing evaluations into cross-modal, cross-time complexity (Rasheed et al., 5 Jun 2025).
- Semantic Evaluation of Visual Aids/Constructions: Current n-gram or format-based metrics may miss semantic equivalence or structural similarity (VisAidMath).
- Comprehensiveness: While some new benchmarks (MathBookEval) approach exhaustive high-school knowledge coverage, extending to full university-level domains, multi-problem compositions, and multilingual contexts remains incomplete.
Ongoing research is focused on combining improved visual perception modules, structured reasoning induction, explicit program synthesis, and process-level supervision in end-to-end architectures. The release and adoption of increasingly sophisticated visual math benchmarks are essential to advancing these directions and closing the gap to human-level visual-mathematical reasoning.
Key References:
- "FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis" (Ondras et al., 9 Nov 2025)
- "GSM8K-V: Can Vision LLMs Solve Grade School Math Word Problems in Visual Contexts" (Yuan et al., 29 Sep 2025)
- "MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles" (Ji et al., 1 Oct 2025)
- "MathSight: A Benchmark Exploring Have Vision-LLMs Really Seen in University-Level Mathematical Reasoning?" (Wang et al., 28 Nov 2025)
- "MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes" (Pande et al., 24 Aug 2025)
- "VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs" (Li et al., 7 Jun 2025)
- "Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency" (Wang et al., 24 Apr 2025)
- "VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos" (Rasheed et al., 5 Jun 2025)
- "MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal LLMs" (Feng et al., 8 Aug 2025)
- "MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?" (Zhang et al., 21 Mar 2024)
- "We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning" (Qiao et al., 14 Aug 2025)
- "MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts" (Wang et al., 28 Feb 2025)
- "DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision LLMs" (Zou et al., 29 Oct 2024)
- "Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset" (Wang et al., 22 Feb 2024)
- "VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning" (Ma et al., 30 Oct 2024)
- "MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification" (Sun et al., 7 Apr 2024)
- "We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?" (Qiao et al., 1 Jul 2024)
- "MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark" (Zhou et al., 14 Aug 2024)
- "MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning" (Li et al., 24 Jul 2025)
- "Evaluating Visual Mathematics in Multimodal LLMs: A Multilingual Benchmark Based on the Kangaroo Tests" (Sáez et al., 9 Jun 2025)