Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
73 tokens/sec
Gemini 2.5 Pro Pro
63 tokens/sec
o3 Pro
25 tokens/sec
GPT-4.1 Pro
71 tokens/sec
DeepSeek R1 via Azure Pro
22 tokens/sec
2000 character limit reached

MathVerse: Benchmark for Visual Mathematical Reasoning

Last updated: June 11, 2025

MathVerse ° is a comprehensive benchmark for evaluating whether Multimodal LLMs ° (MLLMs °) genuinely understand mathematical diagrams, going beyond reliance on textual cues. Its design, evaluation methodology °, and key insights are aimed at providing developers and researchers with actionable tools and reference points for building, diagnosing, and improving AI systems ° for visual mathematical reasoning.


Benchmark Design

Structure and Coverage

  • 2,612 unique visual math problems annotated and reviewed by experts, covering plane geometry, solid geometry, and functions at the high-school level.
  • 15,672 test cases, created by systematically transforming each problem into six multimodal versions to vary the distribution of information between text and diagram.
  • Twelve subfields (e.g., analytic geometry, area, function property, etc.) ensure topic diversity ° relevant for general mathematical AI applications °.

Multi-Modality Problem Versions

Problems are re-encoded as six variant types:

  1. Text-dominant: All relevant info is in the text (including diagram descriptions and numbers).
  2. Text-lite: Descriptive text visible in diagram only.
  3. Text-only: No image, all info in text.
  4. Vision-intensive: More properties moved to the diagram.
  5. Vision-dominant: Numeric/algebraic conditions only in diagram.
  6. Vision-only: All info is purely visual.

Purpose of Transformations:

This process eliminates textual redundancy, requiring models to extract and reason with information directly from diagrams as one moves from text-dominant to vision-dominant versions.


Evaluation Methodology

Chain-of-Thought (CoT) Evaluation

Traditional metrics judge only the final answer. MathVerse, instead:

  • Uses GPT-4 ° (text-only) to extract key reasoning steps from model outputs ° [s1,s2,...,sN][s_1, s_2, ..., s_N].
  • GPT-4V ° (vision-capable) then scores each step for logical, computational, and visual correctness against the provided diagram and ground truth.

The final evaluation aggregates per-step correctness and final answer with a weighted formula: Scorefinal=α(1Ni=1NScore(si))+(1α)Score(sA)\text{Score}_{\text{final}} = \alpha\, \left(\frac{1}{N} \sum_{i=1}^N \text{Score}(s_i)\right) + (1 - \alpha)\,\text{Score}(s_A) with α=0.7\alpha=0.7, yielding a step-wise reasoning score.

Qualitative Error Analysis:

Error types ° are labeled as visual perception, reasoning logic, calculation, or mathematical knowledge failures—allowing precise failure diagnosis ° for model debugging °.


Key Implementation Details

Technical Specifications

  • Diagram Construction:
    • Annotation with PowerPoint; overlays and properties systematically encoded visually.
    • For function plots: matplotlib ° for generating and annotating mathematical graphs.
  • Prompt Engineering:
    • Tailored instructions for each modality (e.g., Vision-only: “According to the question shown in the image…”).
  • Evaluation:
    • Zero-shot settings; standard hardware (A100 GPUs); human and random baselines included.

Practical Insights for Developers

1. Diagnosing Model Weakness

  • Most MLLMs outperform in text-dominant settings but accuracy drops sharply as more information moves to the diagram, highlighting weak diagram "vision."
  • When essential numeric/algebraic information is diagram-only, models’ symbol recognition and value-mapping abilities are exposed as the primary bottleneck.

2. Training Data Impact

  • MLLMs trained with datasets that redundantly encode diagram info in text may “cheat” by ignoring the image. MathVerse’s design eliminates this shortcut, offering a true test of diagrammatic capabilities.

3. Evaluation and Tuning

  • Fine-grained CoT evaluation exposes cases where the answer is right for the wrong reasons (or vice versa), enabling targeted dataset augmentation, model retraining, or architectural tweaks for true visual-math integration.

Example: Applying MathVerse in Practice

Suppose you’re developing or evaluating a vision-LLM for educational AI °. Using MathVerse, you would:

  1. Select appropriate versions (e.g., start with Text-dominant for functionality check; progress to Vision-dominant for visual reasoning stress-testing).
  2. Run the benchmark, capturing not just final answer accuracy but also intermediate CoT steps.
  3. Analyze errors:
    • Is the model misrecognizing visual values or properties?
    • Does it skip steps or hallucinate logic not visible from the diagram?
  4. Iterate:
    • Adjust diagram pre-processing (OCR °, symbol detection, etc.).
    • Retrain or fine-tune with datasets that stress explicit visual property extraction (see Math-PUMA, SVE-Math, or GeoDANO for inspiration).
  5. Deploy:
    • Use MathVerse-style evaluations in production workflows (e.g., automated grading, tutoring), with CoT logging ° for interpretability.

Scaling and Future Directions

Expanding Scope

Best Practices for Model Improvement °

  • Integrate robust OCR and diagram parsing stages into vision pipelines (leveraging findings from recent models like SVE-Math or GeoDANO).
  • Use CoT scoring (rather than mere accuracy) to supervise or reinforce intermediate reasoning steps °.
  • Experiment with curriculum learning approaches (e.g., progressively shifting from text-dominant to vision-dominant data during training).

Summary Table: MathVerse Features

Feature Description
Visual math problems 2,612, diverse, mostly unique diagrams
Test samples 15,672 across 6 multimodal versions
Subjects/Subfields Plane and solid geometry, functions; 12 fine-grained subfields
Evaluation Chain-of-Thought (stepwise) scoring via GPT-4/4V
Human baseline Yes
Error labelling Visual/logic/calculation/knowledge
Open/Closed model eval Yes

In conclusion:

MathVerse sets a practical gold standard for evaluating and improving multimodal LLMs °’ ability to truly “see” and reason about mathematical diagrams—not just manipulate text. Its design and CoT-based analysis enable deep, real-world diagnostic and benchmarking work, making it essential for developers and researchers building robust and interpretable AI systems ° for mathematical education, scientific computing, and beyond.

Project page and dataset: https://mathverse-cuhk.github.io