Multimodal Math-Reasoning Benchmarks

Updated 24 November 2025

The paper highlights that benchmarks rigorously assess large multimodal models’ abilities to process integrated visual and textual mathematical data.
It employs diverse question formats—including multi-step, diagram-based, and open-ended tasks—to quantify fine-grained perception and logical reasoning.
The evaluations use detailed taxonomies and robustness tests to reveal the gap between human performance and current model capabilities, guiding future improvements.

Multimodal math-reasoning benchmarks are standardized evaluation suites that rigorously assess the mathematical and visual reasoning capabilities of large multimodal models (LMMs/MLLMs), which jointly process text and images or video. These benchmarks cover diverse mathematical domains, require visual perception, symbolic manipulation, spatial reasoning, and logical inference, and increasingly demand robustness in real-world conditions. The field has advanced from early synthetic figure reasoning tasks to large-scale, taxonomy-driven, multi-image, and multi-modal scenarios involving photographs, composite table layouts, and even videos. Although recent LMMs exhibit notable progress, evaluations consistently reveal substantial gaps between model and human performance, especially in fine-grained perception, cross-modal integration, and robust multi-step reasoning.

1. Benchmark Scope, Taxonomies, and Dataset Construction

Multimodal math-reasoning benchmarks are characterized by broad coverage along several dimensions:

Domain Coverage and Granularity: Benchmarks range from elementary arithmetic and geometric reasoning (VCBench (Wang et al., 24 Apr 2025), We-Math (Qiao et al., 1 Jul 2024)) to advanced Olympiad or competition-level geometry (MathLens (Chung et al., 2 Oct 2025), MathCheck-GEO (Zhou et al., 11 Jul 2024)) and solid geometry (SolidGeo (Wang et al., 27 May 2025)). Leading datasets incorporate up to 16 subfields (MATH-Vision (Wang et al., 22 Feb 2024)), 67 fine-grained knowledge concepts (We-Math), or thousands of labeled knowledge points (CMMaTH (Li et al., 28 Jun 2024)).
Visual and Linguistic Modalities: Problems require reasoning over diagrams, charts, function plots, tables, block diagrams, and integrated text-image pages or video sequences (VideoMathQA (Rasheed et al., 5 Jun 2025)), with some datasets emphasizing noisy real-scene photos (MathReal (Feng et al., 8 Aug 2025), MathScape (Zhou et al., 14 Aug 2024)) or multi-visual-panel structure (MV-MATH (Wang et al., 28 Feb 2025)), and others focusing on fine-grained graphical answer options (VisioMath (Li et al., 7 Jun 2025)).
Question Formats: Multiple-choice (MC), fill-in-the-blank, open-ended numeric/symbolic, multi-step chains-of-thought, proof/derivation, and outcome/process judgment tasks are represented, with increasing emphasis on step-wise or subproblem decomposition (We-Math, MathScape, MM-MATH (Sun et al., 7 Apr 2024)).
Annotation and Structure: Modern benchmarks annotate each item with taxonomy fields—problem type, knowledge, grade level, visual tags, and difficulty—enabling detailed analytics (MathScape, CMMaTH, MATH-Vision).

Many datasets undergo multi-stage curation, including expert selection from contest archives (MATH-Vision), synthetic or LLM-based distractor/variant generation (MathCheck-GEO), and real-world photo post-processing (MathReal, MathScape). Table cells summarize salient dataset dimensions:

Benchmark	Problems	Modalities	Math Domains
MATH-Vision	3,040	Text + diagrams/charts	16
MM-MATH	5,929	Text + diagrams (grades 7–9, open)	Geometry-centric
VisioMath	1,800	Figures, MC image options	Geometry, data vis.
SolidGeo	3,113	3D diagrams, projections	Solid geometry
MathLens	926	Rendered diagrams + text	Geometry
MV-MATH	2,009	2–8 images per problem	11
MathReal	2,000	Mobile photos (OCR required)	5 K-12 domains
PolyMATH	5,000	Text + diverse cognitive diagrams	10 skills

2. Evaluation Protocols and Metrics

Each benchmark adopts quantitative metrics that emphasize accuracy, but modern evaluations incorporate much finer granularity:

Aggregate Accuracy: Standard metric: $\mathrm{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[\hat{y}_i = y_i]$ for choice or boxed-answer tasks (Wang et al., 22 Feb 2024, Wang et al., 24 Apr 2025, Zhou et al., 14 Aug 2024). Some datasets distinguish between loose and strict grading (MathReal).
Stepwise/Process Metrics: MM-MATH and MathScape introduce process-level or sub-answer segmentation, scoring each intermediate step for correctness. MM-MATH uses LLM-as-a-judge to classify first-step error types (diagram misinterpretation, logical reasoning, calculation, condition misunderstanding) (Sun et al., 7 Apr 2024). We-Math quantifies Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM) via sub-problem outcome grids.
Robustness and Consistency: MathLens and MathCheck-GEO assess robustness via semantic diagram perturbations (flip, rotate, clutter injection) and multiple variants per seed problem. Consistency Rate (CR) captures answer stability across variants. MathCheck-GEO’s 4×4 matrix evaluates problem-solving, answerable/outcome/process judgments, and four robust variants per problem.
Multimodal Table and Video Reasoning: MMTBench measures EM/F1/subspan accuracy on multimodal table QA—including numeric, visual, and chart integration—via multiple baselines: table-as-image, image-caption, entity-replacement, and interleaved scenarios (Titiya et al., 27 May 2025). VideoMathQA adopts MCQ/MBin and step-aligned CoT scoring (Rasheed et al., 5 Jun 2025).

3. Characteristic Reasoning Tasks and Error Analysis

Benchmarks demand a range of subskills:

Fine-Grained Visual Perception: Identification and discrimination of geometric features, chart axes, symbolic annotations, and subtle context-dependent cues. For example, VisioMath requires distinguishing slight variations in parabola orientation or angle-bisector placements among answer images (Li et al., 7 Jun 2025); MM-MATH shows 61% of errors are diagram misinterpretation (Sun et al., 7 Apr 2024).
Mathematical Deduction and Integration: Reasoning over algebraic, geometric, and logical relationships given partial or ambiguous visual data. PolyMATH tasks demand higher-order transformations, spatial visualization, programmatic pattern extrapolation, and chain-of-inference deduction (Gupta et al., 6 Oct 2024). Multi-step reasoning (MM-MATH, MathScape, We-Math) exposes “failure cascades” from early vision errors to downstream logical mistakes.
Cross-Modal Integration: Coordinated understanding of visual and textual information, particularly in multi-image or multi-panel contexts (MV-MATH (Wang et al., 28 Feb 2025), VCBench (Wang et al., 24 Apr 2025)). MathLens decomposes overall error into perception (visual extraction), reasoning (textual deduction), and integration (cross-modal grounding), with findings that integration remains the dominant unresolved failure mode (Chung et al., 2 Oct 2025).
Robustness to Real-World Conditions and Multi-Visual Inputs: MathReal and MathScape highlight severe performance degradation on noisy, perspective-distorted photos and multi-source imagery, with OCR and labeling errors accounting for up to 40–50% of total mistakes (Feng et al., 8 Aug 2025, Zhou et al., 14 Aug 2024).

Error Type	Typical Fraction	Benchmarks Highlighting
Diagram/vision misread	40–61%	MM-MATH, MathReal, PolyMATH
Logical reasoning error	18–42%	MM-MATH, PolyMATH, MATH-Vision
Calculation error	1–11%	MM-MATH, PolyMATH, MathReal
Condition/label error	~10%	MM-MATH, MathReal
Rote memorization/shortcut	up to 75%	We-Math, MathLens

4. Model and Training Paradigms: Findings and Limitations

Comprehensive benchmarking studies consistently show a substantial gap between human and SOTA LMM performance:

Aggregate Outcomes: Human accuracy is 77–95% depending on task (MATH-Vision, SolidGeo, MathReal), while closed-source models (Claude-3.5, GPT-4o, Gemini-2.5) top at 30–55% accuracy (best in geometry multi-image, e.g., Gemini2-Flash, VCBench), and open-source LMMs commonly under 30% (Wang et al., 24 Apr 2025, Wang et al., 22 Feb 2024, Wang et al., 27 May 2025).
Multi-Step and Robustness Declines: All models exhibit sharp performance decrements as problem complexity increases (multi-step in We-Math, MathScape, MV-MATH), or as image conditions degrade. In We-Math, each required reasoning step reduces composite accuracy by ~14–16 percentage points.
Training Paradigm Insights:
- Reinforcement learning (RL) predominantly improves perception—especially with a strong text-SFT foundation; integration lags regardless of pretraining (MathLens (Chung et al., 2 Oct 2025)).
- Knowledge concept augmentation (KCA) reduces insufficient knowledge errors but does not improve generalization (We-Math).
- Chain-of-thought (CoT) and process evaluation reveal that many models can reach the correct answer yet fail to solve aligned subproblems, exposing reliance on superficial patterns or “shortcut” memorization rather than compositional reasoning.

5. Emerging Directions in Benchmark and Model Design

Recent research identifies several axes along which future benchmarks and models are expected to evolve:

Multi-Image and Dynamic Contexts: Datasets such as VCBench and MV-MATH move beyond single-image scenarios, challenging models to integrate cues across 2–18 panels per problem.
Real-Scene and Noisy Data: Emphasis is shifting to real-world photographs (MathReal, MathScape), video-based instruction with cross-modal temporal reasoning (VideoMathQA), and diagrammatic OCR under severe noise (MathReal).
Process-Level and Generalization-Focused Evaluation: Benchmarks such as MathLens, MathCheck, and We-Math explicitly score substeps, integration, and generalization capacity, de-emphasizing end-to-end accuracy in favor of fine-grained behavioral diagnostics.
Robustness and Consistency: Stress-testing via synthetic and annotated variants (MathCheck-GEO, MathLens), adversarial perturbations (e.g., flipping, relabeling, clutter), and checklist paradigms (MathCheck) offer a more faithful probe of true mathematical understanding as opposed to surface-level pattern-matching.

6. Comparative Analysis and Thematic Insights

Benchmarks differ along technical and philosophical lines:

Single-Image vs. Multi-Image vs. Table/Video: Figure-based benchmarks (VisioMath) probe fine-grained discrimination, multi-panel sets (VCBench, MV-MATH) test information aggregation, table benchmarks (MMTBENCH) challenge integration of structured data with visual cues, and video-based suites (VideoMathQA) demand temporal fusion.
Language and Culture: Both English- and Chinese-language datasets now exist and are rapidly expanding in scale and scope (CMMaTH (Li et al., 28 Jun 2024), CMM-Math (Liu et al., 4 Sep 2024)).
Error Taxonomy and Diagnostic Utility: Advanced benchmarks (MathLens, MM-MATH, We-Math, PolyMATH) offer process-level labels for error attribution, revealing models’ deficiencies in subskill mastery, compositional logic, or visual grounding.
Limitations of Existing LMMs: Across all domains—geometry, chart reading, solid modeling, table reasoning, noisy-photo question answering—models are consistently limited by visual perception, diagram parsing, cross-modal binding, and multi-step reasoning under ambiguity.

7. Future Challenges and Recommendations

Findings from these benchmarks converge on several open challenges and research directions:

Visual Encoding and Diagram Parsing: Further progress will require dedicated vision modules trained on geometric primitives, graph-based scene encodings, or pixel-level supervision specific to mathematical diagrams (Sun et al., 7 Apr 2024, Wang et al., 27 May 2025).
Cross-Modal Integration: Architectural advances for dynamic, context-aware multi-modal attention and explicit scene grounding are needed (MathLens, VCBench).
Robust Process Reasoning: Models must improve in verifying each logic step (perhaps by integrating external symbolic tools or verifier modules), reducing rote pattern exploitation and misaligned answer selection [(Chung et al., 2 Oct 2025), We-Math].
Data Augmentation and Pre-Training: Benchmarks recommend augmentation with noise, multi-panel, and dynamic contexts (e.g., blurred/rotated figures, interleaved charts/tables) and targeted pre-training on fine-grained mathematical and visual tasks (Titiya et al., 27 May 2025, Feng et al., 8 Aug 2025).
Fine-Grained Benchmarking: Adoption of process-based evaluation, robust annotation taxonomies, and consensus-based LLM judging is critical for diagnosing and improving genuine mathematical reasoning in future models [(Sun et al., 7 Apr 2024), CMMaTH, MathCheck].

In summary, multimodal math-reasoning benchmarks represent a rapidly developing foundation for empirical progress in multimodal AI; they reveal systematic deficiencies in visual perception, knowledge integration, and robust reasoning, and they provide crucial guidance for both pretraining pipelines and architectural research directions driving the next generation of trustworthy, mathematically competent vision-LLMs.