MathVerse: Benchmark for Visual Mathematical Reasoning
Last updated: June 11, 2025
MathVerse ° is a comprehensive benchmark for evaluating whether Multimodal LLMs ° (MLLMs °) genuinely understand mathematical diagrams, going beyond reliance on textual cues. Its design, evaluation methodology °, and key insights are aimed at providing developers and researchers with actionable tools and reference points for building, diagnosing, and improving AI systems ° for visual mathematical reasoning.
Benchmark Design
Structure and Coverage
- 2,612 unique visual math problems annotated and reviewed by experts, covering plane geometry, solid geometry, and functions at the high-school level.
- 15,672 test cases, created by systematically transforming each problem into six multimodal versions to vary the distribution of information between text and diagram.
- Twelve subfields (e.g., analytic geometry, area, function property, etc.) ensure topic diversity ° relevant for general mathematical AI applications °.
Multi-Modality Problem Versions
Problems are re-encoded as six variant types:
- Text-dominant: All relevant info is in the text (including diagram descriptions and numbers).
- Text-lite: Descriptive text visible in diagram only.
- Text-only: No image, all info in text.
- Vision-intensive: More properties moved to the diagram.
- Vision-dominant: Numeric/algebraic conditions only in diagram.
- Vision-only: All info is purely visual.
Purpose of Transformations:
This process eliminates textual redundancy, requiring models to extract and reason with information directly from diagrams as one moves from text-dominant to vision-dominant versions.
Evaluation Methodology
Chain-of-Thought (CoT) Evaluation
Traditional metrics judge only the final answer. MathVerse, instead:
- Uses GPT-4 ° (text-only) to extract key reasoning steps from model outputs ° .
- GPT-4V ° (vision-capable) then scores each step for logical, computational, and visual correctness against the provided diagram and ground truth.
The final evaluation aggregates per-step correctness and final answer with a weighted formula: with , yielding a step-wise reasoning score.
Qualitative Error Analysis:
Error types ° are labeled as visual perception, reasoning logic, calculation, or mathematical knowledge failures—allowing precise failure diagnosis ° for model debugging °.
Key Implementation Details
Technical Specifications
- Diagram Construction:
- Annotation with PowerPoint; overlays and properties systematically encoded visually.
- For function plots: matplotlib ° for generating and annotating mathematical graphs.
- Prompt Engineering:
- Tailored instructions for each modality (e.g., Vision-only: “According to the question shown in the image…”).
- Evaluation:
- Zero-shot settings; standard hardware (A100 GPUs); human and random baselines included.
Practical Insights for Developers
1. Diagnosing Model Weakness
- Most MLLMs outperform in text-dominant settings but accuracy drops sharply as more information moves to the diagram, highlighting weak diagram "vision."
- When essential numeric/algebraic information is diagram-only, models’ symbol recognition and value-mapping abilities are exposed as the primary bottleneck.
2. Training Data Impact
- MLLMs trained with datasets that redundantly encode diagram info in text may “cheat” by ignoring the image. MathVerse’s design eliminates this shortcut, offering a true test of diagrammatic capabilities.
3. Evaluation and Tuning
- Fine-grained CoT evaluation exposes cases where the answer is right for the wrong reasons (or vice versa), enabling targeted dataset augmentation, model retraining, or architectural tweaks for true visual-math integration.
Example: Applying MathVerse in Practice
Suppose you’re developing or evaluating a vision-LLM for educational AI °. Using MathVerse, you would:
- Select appropriate versions (e.g., start with Text-dominant for functionality check; progress to Vision-dominant for visual reasoning stress-testing).
- Run the benchmark, capturing not just final answer accuracy but also intermediate CoT steps.
- Analyze errors:
- Is the model misrecognizing visual values or properties?
- Does it skip steps or hallucinate logic not visible from the diagram?
- Iterate:
- Adjust diagram pre-processing (OCR °, symbol detection, etc.).
- Retrain or fine-tune with datasets that stress explicit visual property extraction (see Math-PUMA, SVE-Math, or GeoDANO for inspiration).
- Deploy:
- Use MathVerse-style evaluations in production workflows (e.g., automated grading, tutoring), with CoT logging ° for interpretability.
Scaling and Future Directions
Expanding Scope
- Planned future releases will include problem difficulty tiers, expansion to college and scientific topics, and multilingual support °.
- Enhanced diagram annotation tools ° and error taxonomy ° will enable finer benchmarking and supervision, especially for advanced math and multidisciplinary AI ° applications.
Best Practices for Model Improvement °
- Integrate robust OCR and diagram parsing stages into vision pipelines (leveraging findings from recent models like SVE-Math or GeoDANO).
- Use CoT scoring (rather than mere accuracy) to supervise or reinforce intermediate reasoning steps °.
- Experiment with curriculum learning approaches (e.g., progressively shifting from text-dominant to vision-dominant data during training).
Summary Table: MathVerse Features
Feature | Description |
---|---|
Visual math problems | 2,612, diverse, mostly unique diagrams |
Test samples | 15,672 across 6 multimodal versions |
Subjects/Subfields | Plane and solid geometry, functions; 12 fine-grained subfields |
Evaluation | Chain-of-Thought (stepwise) scoring via GPT-4/4V |
Human baseline | Yes |
Error labelling | Visual/logic/calculation/knowledge |
Open/Closed model eval | Yes |
In conclusion:
MathVerse sets a practical gold standard for evaluating and improving multimodal LLMs °’ ability to truly “see” and reason about mathematical diagrams—not just manipulate text. Its design and CoT-based analysis enable deep, real-world diagnostic and benchmarking work, making it essential for developers and researchers building robust and interpretable AI systems ° for mathematical education, scientific computing, and beyond.
Project page and dataset: https://mathverse-cuhk.github.io