MathVL-test: Multimodal Math Reasoning Benchmark

Updated 14 November 2025

MathVL-test is a benchmark designed to assess mathematical and spatial reasoning via controlled synthetic surface plots.
It uses rigorous topological counting and transformation recognition tasks to isolate abstract mathematical analysis from visual semantics.
Evaluations reveal that even top-performing models struggle with increased complexity and stylistic variance, highlighting critical limitations.

MathVL-test, also cited as MaRVL-QA-Mini, is a standardized, multimodal benchmark specifically constructed to probe the mathematical and spatial reasoning ability of vision-LLMs. Unlike typical visual mathematical datasets that are confounded by semantic cues or object references, MathVL-test focuses on surface plot analysis, isolating the model’s pure capacity for mathematical abstraction from image data. It has become a reference testbed for evaluating advancements in multimodal reasoning and neural model interpretability in the mathematical sciences.

1. Purpose, Motivation, and Design Philosophy

MathVL-test was conceived to address the limitations of conventional visual mathematics evaluation, where semantic information (object classes, text labels) in natural images or diagrammatic problems can inadvertently assist models via pattern recognition rather than genuine spatial or topological reasoning. By utilizing synthetically rendered 3D surface plots of the form $z = f(x, y)$ —with function families drawn from mathematical analysis and physics—the benchmark eliminates non-mathematical confounders and forces models to reason about features such as maxima, minima, and geometric transformations based strictly on mathematical structure.

The specific objectives are:

To quantify and dissect the reasoning skills of MLLMs beyond semantic object recognition.
To provide a controlled testbed free of domain overlap with natural-image pretraining.
To support systematic error analysis and to guide model and algorithmic improvements in multimodal mathematical reasoning (Pande et al., 24 Aug 2025).

2. Core Task Definitions and Mathematical Formalism

MathVL-test comprises two principal tasks:

2.1 Topological Counting

Given an image of a surface plot representing $z = f(x, y)$ over a domain $D \subset \mathbb{R}^2$ , the challenge is to enumerate strict local maxima or minima. The mathematical definition of extremal points follows analytic conventions:

Critical points: $(x_0, y_0)$ where $\nabla f(x_0, y_0) = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}\right)_{(x_0, y_0)} = (0, 0)$ .
The Hessian matrix at $(x, y)$ is $H_f = \begin{pmatrix} f_{xx} & f_{xy} \ f_{xy} & f_{yy} \end{pmatrix}$ .
Strict local maximum: $\nabla f(x_0, y_0) = 0$ and $H_f$ negative-definite (eigenvalues $< 0$ ).
Strict local minimum: as above, $H_f$ positive-definite (eigenvalues $> 0$ ).

Critical points are numerically detected with the following pipeline:

Sample $f$ on a $2000 \times 2000$ grid to estimate candidate peaks/valleys.
Refine each seed using numerical root-finding (e.g., Newton’s method) to high precision.
Reject points where $|\nabla f| > \varepsilon$ or that are too close to domain boundaries (within $\delta/2$ ).
Enforce a minimum separation $\delta \approx 0.1 \cdot$ domain size between extrema.
Certify classification as maxima/minima via Hessian eigenvalue analysis.

2.2 Transformation Recognition

Given two surface plots $A$ and $B$ depicting the same function (possibly transformed), the model is to identify which of the following group actions maps $A \to B$ :

Rotations: $R_\theta(x, y) = (x\cos\theta - y\sin\theta, x\sin\theta + y\cos\theta)$ for $\theta \in \{90^\circ, 180^\circ\}$ (clockwise).
Translations: $T_v(x, y) = (x - v_x, y - v_y)$ , with $|v_x|$ or $|v_y| \in [0.15 \mathrm{~to~} 0.25] \cdot$ (domain range) on a single axis.
“No Change” is included as an option.

The answer must be one of {No Change, Rotate 90°, Rotate 180°, Translate $+x$ , Translate $+y$ }. Stringent ambiguity filtering ensures that surface symmetry does not confound the mapping: functions exhibiting transformation symmetry at the required parameters are rejected.

3. Dataset Construction and Ambiguity Filtering

3.1 Function Library and Sampling

The source function families number 32, organized into five tiers of complexity (planes → quadrics → periodic surfaces → Gaussian mixtures → “specials”).
Parameterization (amplitudes, frequencies, centers, etc.) is sampled uniformly per family.
Each surface is rendered over a carefully selected $[-d, d]^2$ domain, ensuring the centrality of key features.

3.2 Rendering Styles and Robustness

Three visual styles for every instance: heatmap, contour, and overlay; colormaps are chosen from {viridis, plasma, inferno, magma}.
Plots always include axis tick labels, which are held constant for paired images in Transformation tasks.
For additional robustness testing, style-mismatch pairs are included in the Transformation Recognition subset.

3.3 Ambiguity and Confounder Elimination

Ambiguity filtering is essential to preclude visually indistinct cases or ones with ambiguous ground truth:

Extrema separation: $\| (x_i, y_i) - (x_j, y_j)\| > \delta$ (with $\delta \approx 0.1$ domain).
Numerics: Only accept points with $|\nabla f| < \varepsilon \approx 10^{-4}$ .
Discard surface instances with extrema near the boundary.
For Transformations: explicitly reject functions invariant under the tested transformation; compare pixelwise $L_2$ distances to eliminate cases where, e.g., a rotation is visually indistinguishable from a translation.

4. Benchmark Size, Splits, and Evaluation Protocol

4.1 Dataset Statistics

Subset	Number of Examples	Comments
Topological Counting	1,548	All unique, ambiguity-filtered plots
Transformation Recognition	1,200	300 of each transformation type (90°, 180°, $x$ , $y$ ) x same/different style (600 each)

Ground-truth answers are always a single integer (count or option number).

4.2 Evaluation and Scoring

Strict zero-shot prompting is enforced, requiring models to emit a single answer in an XML schema: <final_answer>…</final_answer>.
Outputs are verified by an LLM (GPT-4.1) against the standard answer key.
The main metric is answer accuracy (percent correct), with confidence intervals of approximately $\pm0.5\%$ due to test set size ( $n = 2,748$ ).

5. Experimental Results and Failure Analysis

5.1 Topological Counting Performance

The top model (o4-mini) achieves 58.91% overall accuracy (maxima: 60.91%, minima: 57.14%).
Models consistently perform better when counting maxima than minima; minima (“valleys”) have lower contrast and are less salient.
There is a steep decline in accuracy as count increases: o4-mini scores 71.2% for count < 7 but only 14.8% for count $\geq$ 13.
Typical errors include over- or under-counting by 1–3 when peaks/valleys cluster or when signal contrast is low.
Breakdown is near-complete (accuracy approaching zero) for very high-count or dark/low-contrast cases.

5.2 Transformation Recognition Performance

Best accuracy: o4-mini at 67.92%, o3 at 67.0%.
Translations (mean $\sim$ 83%) are much easier for models than rotations (51–54%).
When uncertain, many models default to “No Change.”
Some architectures (e.g., LLaVA–34b) degenerate to a single-heuristic guessing pattern, always choosing a specific transformation.
Style-robustness is generally high, but some models paradoxically score higher on style-mismatched inputs, suggesting exploitation of superficial visual features.

6. Representative Task Examples

6.1 Topological Counting

Given $f(x, y) = e^{-((x-1)^2 + (y-1)^2)} + e^{-((x+1)^2 + (y+1)^2)}$ on $D = [-3, 3]^2$ (contour+heatmap), the “How many local maxima?” question is solved by:

Numerically solving $\nabla f = 0$ yields two critical points at $(1, 1), (-1, -1)$ .
Both have negative-definite Hessians (strict maxima), and are well-separated and away from the boundary.
No other extrema survive the filtering.
Answer: <final_answer\>2</final_answer>.

6.2 Transformation Recognition

Given $f(x, y) = \sin(x)\cos(y)$ on $D = [-\pi, \pi]^2$ , and two heatmaps $A$ , $B$ , with tick labels unchanged, $B$ shows ridges along $x=0$ rather than $y=0$ . This implicates a $90^\circ$ clockwise rotation ( $x, y) \to (y, -x)$ . The correct answer: <final_answer\>2</final_answer>.

7. Impact, Use, and Limitations

MathVL-test (MaRVL-QA-Mini) provides a rigorous and highly controlled paradigm for evaluating the mathematical and spatial reasoning of MLLMs in abstraction from semantic object recognition and pattern-matching biases. Its curated, ambiguity-filtered design ensures that performance reflects true conceptual and visual reasoning, not lexical or dataset artifacts. Key takeaways from its deployment:

Even state-of-the-art models perform well below expert human levels, with the best accuracy rarely exceeding 69%.
Performance rapidly deteriorates with increasing task complexity (feature count, non-salient structures).
Models are highly sensitive to both visualization style and class imbalance; some exploit stylistic or plot-based heuristics rather than genuine mathematical analysis.
The benchmark is agnostic to most common pretraining sets, revealing “blind spots” in neural visual understanding.

A plausible implication is that MathVL-test reveals profound limitations in current neural architectures’ ability to generalize mathematical spatial reasoning from images. Its methodology is now broadly referenced for work seeking to decompose perceptual, heuristic, and conceptual errors in visual mathematical AI (Pande et al., 24 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes (2025)

Follow Topic

Get notified by email when new papers are published related to MathVL-test.