Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

MathVerse: Visual Math Reasoning Benchmark

Updated 30 June 2025

MathVerse is a comprehensive multimodal benchmark designed to evaluate both visual and textual mathematical reasoning in MLLMs.
It features 2,612 problems across plane geometry, solid geometry, and functions, each presented in six tailored versions to dissect modality-specific skills.
It employs a two-stage chain-of-thought scoring protocol that provides granular feedback on diagram interpretation and step-by-step reasoning.

MathVerse is a comprehensive multimodal mathematics benchmark explicitly designed to diagnose and advance the visual mathematical reasoning abilities of Multi-modal LLMs (MLLMs). Developed to remedy key insufficiencies in prior evaluation suites, MathVerse features a large-scale, meticulously curated dataset, a highly controlled multi-version problem design to dissect visual versus textual skills, and a granular, chain-of-thought-based evaluation protocol. Its central focus is to rigorously test whether MLLMs genuinely interpret and reason over diagrams in mathematical problems, setting a technical standard for advancement and analysis in this domain.

1. Composition and Design of the Benchmark

MathVerse comprises 2,612 high-quality visual math problems, each paired with diagrammatic content and spanning three major subjects: plane geometry (e.g., triangles, circles, polygons), solid geometry (e.g., cubes, spheres), and functions (including coordinate-based and analytic reasoning). Each problem is manually classified into 12 subfields to enable fine-grained capability assessment.

A distinctive aspect is the multi-version transformation of each problem, compelling models to variously rely on text, vision, or both:

Text-dominant: All descriptive, property, and essential condition information is in text, with the diagram present but potentially redundant.
Text-lite: Text omits redundant descriptive details available in the diagram, shifting more “extractive” work to the model’s visual perception.
Text-only: The diagram is removed; all information is purely textual, serving as a test of models’ non-visual (symbolic) reasoning.
Vision-intensive: Some implicit properties are omitted textually and must be visually inferred (e.g., parallelism, function monotonicity).
Vision-dominant: Certain essential conditions (e.g., key values, labels) are omitted from text and instead encoded solely in the diagram, requiring the model to recover exact semantic content visually.
Vision-only: All content, from problem statement to data, is diagrammatic; the text is minimal or missing, presenting a maximal test of diagram understanding.

This results in approximately 15,672 evaluation instances (2,612 problems × 6 versions), ensuring differential stress on visual versus textual abilities and enabling explicit measurement of modality-specific reasoning.

2. Evaluation Method: Multi-Step Chain-of-Thought Scoring

Instead of conventional answer-based accuracy metrics, MathVerse adopts a two-stage chain-of-thought (CoT) evaluation strategy for fine-grained insight:

Key-Step Extraction: For each produced solution, GPT-4 (text-only) is prompted to extract the core reasoning steps $[s_1, ..., s_N]$ and final answer $s_A$ from the MLLM output. This is intentionally independent of ground-truth pathways, allowing diverse, valid reasoning routes.
Step-Wise Scoring: GPT-4V, receiving the original problem (with visual content) and ground-truth, scores each extracted step and the final answer as correct (1) or incorrect (0). For function problems, relevant diagram annotations are provided to ensure scoring fairness.

The final score for each sample is computed by a weighted combination: $\text{Score}_\text{final} = \alpha \cdot \left(\frac{1}{N} \sum_{i=1}^N \text{Score}(s_i)\right) + (1-\alpha) \cdot \text{Score}(s_A)$ with $\alpha=0.7$ to place emphasis on reasoning quality over mere answer correctness.

This protocol provides explicit feedback on intermediate reasoning, identifies precise points of failure (perception, logic, calculation), and supports detailed error analysis.

3. Dataset Sourcing, Annotation, and Problem Generation

The benchmark’s problems are sourced from existing datasets (e.g., GeoQA, GEOS, Geometry3K) and wide-reaching public repositories, supplemented through new manual collection and annotation. Of the 2,612 total, 1,236 problems are newly curated—especially emphasizing solid geometry and functions, which were underrepresented in previous work.

Each problem is reviewed by a qualified annotator (senior undergraduate or graduate student). Problems are excluded if they are trivial, excessively difficult, ambiguous, require external knowledge, or are diagramatically corrupted. Function-type questions are specifically annotated with core properties to facilitate fair multi-modal grading.

All diagrams undergo controlled, version-specific transformation using toolchains such as Mathpix, Matplotlib, and PowerPoint to ensure consistent information distribution between text and vision inputs.

4. Key Findings and Implications

MathVerse’s diagnostic protocol reveals several critical realities in the current state of MLLMs:

Most MLLMs predominantly exploit textual shortcuts; in some cases, accuracy increases when diagrams are omitted (e.g., Qwen-VL-Max and InternLM-XComposer2 increase by 5%+ in text-only scenarios).
Genuine visual mathematical reasoning—especially precise extraction of values, relationships, or quantifiers from diagrams—remains weak: visual perception failures dominate error patterns in vision-dominant versions.
GPT-4V, and to a lesser extent ShareGPT4V, perform best on diagram-dependent tasks but remain significantly below human proficiency.
CoT analysis reveals that partial, incorrect, or “lucky” reasoning is common; final answer accuracy alone masks such partial progress or missteps.
The subject-specific breakdown identifies function property/coordinate inference as a persistent weak point, and diagram-heavy item types as performance bottlenecks even for the strongest models.

A plausible implication is that training regimens must target multi-modal information fusion, not purely symbolic or generalized visual grounding, and that benchmarks with “leaky” or redundant text fail to expose true diagrammatic reasoning deficiencies.

5. Resources and Usage

MathVerse provides extensive resources for benchmarking and further research:

All data (problems, diagrams, all six versions), scripts (for question answering and CoT assessment), transformation protocols, error annotations, and supporting documentation are openly available through the project’s web page: https://mathverse-cuhk.github.io
A taxonomy of categories, subfields, and annotation procedures is provided to facilitate comparison and extension.

The platform enables rigorous, repeatable measurement of both model-wide progress and fine-grained learning on visual mathematical reasoning.

6. Broader Impact and Future Directions

MathVerse highlights the need for visual-mathematical encoder innovation: progress stalls largely due to poor diagram understanding, not symbolic reasoning. Effective improvement will likely require:

New model architectures that fuse perception and symbolic processing, specifically tuned for mathematical diagrams;
Targeted training data incorporating high-quality, visually-annotated math content, rather than generic vision-language data;
Supervision and feedback paradigms that reward step-wise reasoning and explicitly penalize misperception or partial logic.

By closing the gap between text and vision capabilities—and by raising the bar for evaluation beyond “shortcutable” benchmarks—MathVerse serves as a central reference for developing MLLMs with authentic, human-like visual mathematical skill. It enables both rigorous performance evaluation and targeted analysis to guide subsequent architectural, data, and training innovations in multimodal AI research.

Aspect	MathVerse Contribution
Subjects	Plane geometry, solid geometry, functions; 12 subfields
Problem versions	Six (different text/vision information allocations)
Samples	2,612 problems × 6 = ~15,672 evaluation instances
Evaluation	Multi-step Chain-of-Thought strategy, per-step error analysis
Quality control	Expert review, diversity curation, exclusion of ambiguous/unrealistic cases
Resource page	https://mathverse-cuhk.github.io

PDF Markdown Chat (Upgrade)