Papers
Topics
Authors
Recent
2000 character limit reached

We-Math Benchmark: Visual Math Reasoning

Updated 28 December 2025
  • The paper presents a novel suite of evaluation protocols that decompose visual mathematical problems into explicit knowledge components.
  • We-Math Benchmark integrates hierarchical taxonomies and multi-dimensional metrics to diagnose model performance beyond end-to-end accuracy.
  • MathBookEval, a core dataset, provides expert-verified visual problems with step-wise mapping to atomic mathematical concepts.

We-Math Benchmark

The We-Math benchmark is a suite of resources and evaluation protocols designed specifically to probe visual mathematical reasoning in large multimodal models (LMMs), with a focus on decomposing complex problems into explicit knowledge components and diagnosing model performance beyond end-to-end accuracy. It is centered on visual-mathematical tasks, integrating hierarchical concept taxonomies and multi-dimensional evaluation metrics to address core limitations of prior datasets, such as lack of fine-grained error analysis and insufficient grounding in mathematical pedagogy. The benchmark family includes the original We-Math dataset and diagnostic protocol (Qiao et al., 2024), and, in its most advanced incarnation, MathBookEval, a 1,000-problem, expert-verified benchmark and integral part of the We-Math 2.0 framework (Qiao et al., 14 Aug 2025).

1. Motivation and Conceptual Framework

The development of We-Math is motivated by fundamental limitations observed in generic visual-math benchmarks such as MathVista and MathVerse, which provide only end-to-end accuracy signals without diagnosing failure sources. Two central principles drive the We-Math design:

  • Atomic mastery requirement: Mathematical reasoning is decomposed into discrete knowledge concepts, echoing the didactic progression from simple to composite concepts.
  • Compositional integration: Multi-step problems are explicitly mapped to their constituent knowledge requirements, enabling scrutiny of how models compose and generalize atomic knowledge.

This approach reflects the cognitive trajectory of human learners and targets both knowledge acquisition and the integrative aspect of mathematical problem-solving (Qiao et al., 2024).

2. Dataset Construction and Knowledge System

Hierarchical Knowledge Design

We-Math 2.0 introduces a structured, five-level mathematical knowledge hierarchy, ranging from abstract definitions to theorems and applications covering primary through university mathematics. This framework encompasses 491 terminal knowledge points (e.g., “triangle area,” “quadratic formula,” “conditional probability”) and 1,819 fundamental principles, each validated and curated through a blend of expert-driven taxonomy and AI-assisted clustering:

Source Description Scale
Human Experts Textbooks, curricula, Wikipedia outline Initial taxonomy
GPT-4o Tagging and clustering of 30,000 problems AI-driven concept tree
Expert Review Tree merging and refinement 491 points, 1,819 principles

Final knowledge points form the backbone of both dataset construction and evaluation protocols (Qiao et al., 14 Aug 2025).

Problem Design

MathBookEval, the flagship dataset, consists of 1,000 visual-mathematics problems (600 re-used from prior benchmarks and 400 newly written to ensure conceptual and procedural coverage). Each problem is linked to explicit solution steps, with each step mapped to a single knowledge point to facilitate step-aware evaluation. Problems are stratified by step count:

  • Level 1: 1–3 steps (62.0%)
  • Level 2: 4–6 steps (30.2%)
  • Level 3: 7–10 steps (7.8%)

Formats include multiple-choice and fill-in-the-blank items, frequently accompanied by GeoGebra-rendered diagrams. Problems are further informed by three orthogonal dimensions: step complexity, visual complexity, and contextual complexity. However, in MathBookEval, only reasoning depth (i.e., step count) and knowledge domain are explicitly annotated (Qiao et al., 14 Aug 2025).

Sample Problem Illustrations

  1. Algebra (Level 1):

- Question: Solve for xx: 2x2+3x5=02x^2 + 3x - 5 = 0. - Steps:

1. Compute discriminant: Δ=3242(5)=49\Delta = 3^2 - 4\cdot2\cdot(-5) = 49 2. Apply quadratic formula: x=3±74x = \frac{-3 \pm 7}{4} 3. Conclude: x=1,x=2.5x = 1, x = -2.5

  1. Geometry (Level 3):

- Question: In a circle, chords ABAB and CDCD intersect at PP. Given APC=40\angle APC = 40^\circ, BPC=70\angle BPC = 70^\circ, find BAD\angle BAD. - Steps: Use vertical angles, relate arcs to inscribed angles, sum arcs, subtract from 360360^\circ, divide for final inscribed angle.

3. Evaluation Protocols and Metrics

MathBookEval employs an “LLM-as-judge” protocol, in which GPT-4o evaluates candidate model responses (with or without chain-of-thought) and marks them correct if the final answer matches the ground truth. Key features include:

  • Overall accuracy (primary metric)
  • Breakdowns by reasoning level (L1/L2/L3) and knowledge domain (Geometry, Algebra, Fundamental Skills, Probability & Statistics)
  • Step-level correctness and chain-of-thought coherence, recorded but not used for leaderboard reporting

All 1,000 items constitute a zero-shot testbed: there is no held-out training data, and no few-shot prompts are used for evaluation. Each annotation is confirmed by two independent experts; only items with unanimous agreement on solution and step-to-knowledge mapping are included (Qiao et al., 14 Aug 2025).

4. Diagnostic Dimensions and Error Taxonomy

Earlier We-Math protocols introduced a four-dimensional error taxonomy (Qiao et al., 2024):

  • Insufficient Knowledge (IK): Sub-problems and composite questions both answered incorrectly
  • Inadequate Generalization (IG): Sub-problems answered correctly but composite fails
  • Complete Mastery (CM): Both sub-problems and composite answered correctly
  • Rote Memorization (RM): Composite correct but sub-problem(s) incorrect (indicative of shallow pattern recognition)

The benchmark reports the rate of samples in each category, along with aggregated reasoning-confidence scores that penalize IK (weighted zero), partially reward IG (default β=0.5\beta=0.5), and fully reward CM.

Scoreaverage=αSIK+βSIG+SCM\mathrm{Score}_{\rm average} = \alpha\,S_{IK} + \beta\,S_{IG} + S_{CM}

with default α=0,β=0.5\alpha=0, \beta=0.5 (Qiao et al., 2024).

This taxonomy reveals qualitative differences in model learning stages: smaller models frequently remain mired in IK, while advanced LMMs (e.g., GPT-4o) demonstrate a shift from IK to IG as they generalize learned sub-concepts but sometimes fail to compose them successfully. High RM rates in open-source models expose vulnerabilities to superficial pattern matching.

5. Experimental Results and Model Comparisons

Extensive evaluation on MathBookEval demonstrates a consistent decline in accuracy with increased reasoning steps. Representative results from (Qiao et al., 14 Aug 2025):

Model Overall Acc. L1 L2 L3 Geo Alg FS PS
Qwen2.5-VL-7B 46.7% 50.1 43.0 33.3 38.8 60.2 44.1 58.5
MathBook-7B 50.4% (+3.7) 52.0 48.2 45.8 40.5 63.3 57.4 67.9
  • The MathBook-7B model, trained with MathBook-RL, achieves a 3.7 percentage point improvement in overall accuracy and a 12.5 point gain on advanced problems (L3).
  • Domain stratification reveals that algebra and probability/statistics are relatively tractable (up to 67.9% accuracy), with geometry persistently the most challenging (40.5%).
  • For all models, performance diminishes with increasing step count.

Comparisons to classic text-only math reasoning datasets (e.g., GSM8K, MATH) position MathBookEval as uniquely demanding, given its inclusion of diagrammed problems, explicit step-level mapping, and visual reasoning components. Even state-of-the-art LMMs (e.g., GPT-4o) achieve only ~51% overall accuracy in the zero-shot protocol, with substantial accuracy drops as step-count increases (Qiao et al., 14 Aug 2025).

6. Comparative Benchmarks and Positioning

Benchmark Modality #Problems Step Depth Visuals Step Mapping Evaluation Focus
GSM8K Text ~8,000 ≈3–4 No No Arithmetic accuracy
MATH Text ~12,000 ≈5–7 No No Competition math
We-Math Visual 6,524 1–3 Yes Yes Error diagnosis
MathBookEval Visual 1,000 1–10 Yes Yes Stepwise accuracy

MathBookEval combines expert-verified, granular knowledge point coverage, reasoning stratification up to 10 steps, and mandatory diagram interpretation. This design exposes weaknesses in both spatial reasoning and extended multi-step logical chaining unavailable in prior datasets (Qiao et al., 2024, Qiao et al., 14 Aug 2025).

7. Implications, Limitations, and Future Directions

Empirical usage of We-Math and MathBookEval establishes that:

  • Step count is the dominant barrier: accuracy drops sharply as reasoning chains lengthen.
  • Many failure cases are attributable to missing elementary knowledge (high IK), which can be mitigated by knowledge augmentation (e.g., concept cards).
  • Current state-of-the-art LMMs (notably GPT-4o) are beginning to transition from pure knowledge deficits (IK) to more sophisticated integration challenges (IG).
  • No model demonstrates universal complete mastery (CM), and high RM rates underscore the inadequacy of end-to-end accuracy as a sole metric.

Potential future directions include concept-level pre-training and retrieval-augmented inference (formula sheets, “mini-lecture” retrieval), modular architectures that assign compositional reasoning and sub-problem solving to specialized modules, and curriculum learning by decomposition to reinforce atomic concept mastery before composition (Qiao et al., 2024, Qiao et al., 14 Aug 2025).


We-Math, and especially MathBookEval as realized in We-Math 2.0, establish a multi-faceted foundation for evaluating and diagnosing the visual mathematical reasoning abilities of multimodal LLMs, revealing both progress and enduring limitations in the field (Qiao et al., 2024, Qiao et al., 14 Aug 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to We-Math Benchmark.