We-Math 2.0: Unified Reasoning for MLLMs
- We-Math 2.0 is a unified mathematical reasoning system that maps every step of a solution to a precise knowledge point within its five-level MathBook Knowledge System.
- It leverages dual dataset strategies (MathBook-Standard and MathBook-Pro) and a two-stage reinforcement learning framework to address both structural and visual complexities in mathematical tasks.
- The integrated MathBookEval benchmark rigorously tests generalization across diverse mathematical domains, ensuring robustness and interpretability in multi-modal problem-solving.
We-Math 2.0 is a unified mathematical reasoning system for multimodal LLMs (MLLMs) that integrates a five-level hierarchical knowledge system, advanced dataset construction strategies, a two-stage reinforcement learning (RL) training paradigm, and a comprehensive benchmark. Its design addresses two critical deficiencies identified in prior approaches: the lack of a structured, knowledge-oriented framework and insufficient alignment between model training and the compositional data space of mathematical reasoning tasks. We-Math 2.0 aims to incentivize MLLMs to achieve interpretable, knowledge-driven, and generalizable reasoning that robustly spans a full spectrum of mathematical concepts, visual complexities, and problem-solving contexts (Qiao et al., 14 Aug 2025).
1. MathBook Knowledge System
At the core of We-Math 2.0 lies the MathBook Knowledge System, a hierarchical structure built to support fine-grained, knowledge-oriented reasoning supervision. This five-level system comprises 491 distinct knowledge points and 1,819 knowledge principles, systematically covering the "Definition–Theorem–Application" paradigm. These knowledge points were annotated and validated through a hybrid human–AI pipeline: first, domain experts assembled an initial structure referencing textbooks, Wikipedia, and curricular standards; concurrently, approximately 30,000 math problems were automatically tagged and then vector-clustered using GPT-4. Human experts then reconciled and refined the merged taxonomy.
Each step in a problem’s chain-of-thought solution is explicitly mapped back to a knowledge point in the hierarchy. This approach ensures that every model-generated reasoning step can be attributed to a precise mathematical concept, providing interpretability for both model assessment and learning supervision. The resulting system spans across domains and subdomains, supporting both foundational and advanced topics and enabling diverse, interpretable data annotation.
2. MathBook-Standard & MathBook-Pro Datasets
Dataset construction in We-Math 2.0 centers on the MathBook-Standard and MathBook-Pro corpora, each developed to maximize both conceptual breadth and structural flexibility. MathBook-Standard is seeded by a collection of problems, each with an associated high-fidelity diagram rendered in GeoGebra. Dual expansion is achieved via:
- One-problem–multi-image: holding the question constant while generating multiple diagrammatic realizations (e.g., triangles of various types for a fixed geometric theorem).
- One-image–multi-problem: creating multiple distinct questions for a single diagram, each targeting different knowledge points or aspects.
MathBook-Pro introduces systematic variation along three difficulty axes, resulting in seven progressive variants per problem:
- Step Complexity: increasing the number of reasoning steps required, often by introducing auxiliary conclusions.
- Visual Complexity: adding diagrammatic elements that, while not altering the core structure, raise the visual or spatial demand.
- Contextual Complexity: recasting the question into more elaborate or real-world scenarios, thereby requiring the model to generalize its reasoning capabilities.
This three-dimensional difficulty modeling enables the generation of curriculum-aligned datasets that foster robust learning and support rapid scaling for diverse training and evaluation regimes.
3. MathBook-RL: Two-Stage Reinforcement Learning Framework
We-Math 2.0 introduces the MathBook-RL framework to align MLLMs with explicit, knowledge-driven chain-of-thought reasoning and to encourage progressive mastery. The two-stage RL paradigm comprises:
- Cold-Start Fine-tuning: Models are first supervised with MathBook-Standard data, where each sample is rewritten to contain explicit natural language, chain-of-thought explanations, referencing the appropriate knowledge points. The objective function in this stage is standard negative log-likelihood:
- Progressive Alignment RL: The model is then advanced to the MathBook-Pro variants via Group Relative Policy Optimization. Two dynamic data scheduling approaches modulate the training:
- Knowledge Increment Scheduling: When the model fails at increased step-complexity, it is re-routed to problems isolating the novel knowledge point.
- Modality Increment Scheduling: Failures due to increased visual or contextual complexity are addressed by guiding the model through samples isolating those features.
The RL objective uses a mean-based reward system across grouped variants. In simplified form:
where denotes an average reward based on answer accuracy, the model policy, and the KL-penalty weight. This curriculum-aligned regime supports the transition from interpretable step-wise reasoning to robust compositional mastery.
4. MathBookEval Benchmark
MathBookEval is designed to rigorously quantify both the depth and breadth of mathematical reasoning generalization. It encompasses all 491 knowledge points, distributed across four domains and thirteen subdomains. Problems are stratified into three step-based levels:
- Level 1: 1–3 steps (basic reasoning)
- Level 2: 4–6 steps (intermediate reasoning)
- Level 3: 7–10 steps (complex reasoning)
Each problem is meticulously annotated with step-by-step solutions, mapping each step to a specific knowledge point. Model performance is evaluated using standardized prompts and “LLM-as-a-judge” methods (e.g., with GPT-4o), providing multidimensional insights into answer correctness and knowledge integration across various reasoning complexities.
5. Empirical Results and Impact
MathBook-RL was benchmarked on MathVista, MathVision, MathVerse, We-Math, and MathBookEval, with instantiations such as MathBook-7B compared to both closed-source models (e.g., GPT-4o) and open-source baselines (Qwen2.5-VL, Math-PUMA). Key findings include:
- MathBook-RL delivers competitive or superior results on established multimodal math benchmarks.
- RL training on structurally-varied variants (e.g., one-problem–multi-image) enhances robustness for visual compositional reasoning.
- Dynamic scheduling improves transfer from simple to complex problem variants, especially for multi-modal and multi-step settings.
- On MathBookEval, MathBook-RL achieves strong generalization, particularly in subdomains like geometry, as evidenced by higher accuracy on deep multi-step problems.
Ablation studies confirm that curriculum-aligned RL yields significant boosts beyond supervised fine-tuning alone, particularly in mastering step-complex and visually demanding problems. This approach directly addresses common model limitations, such as rote memorization and failure to generalize compositional knowledge.
6. Significance and Prospective Applications
We-Math 2.0 establishes a comprehensive framework incentivizing interpretable, incremental reasoning over broad mathematical domains and modalities. The explicit mapping of reasoning steps to annotated knowledge points yields fine-grained inspection and guides future improvements in curriculum design, difficulty modeling, and multi-modal data augmentation. The system’s integration of dynamic training schedules and richly structured datasets provides a paradigm for robust, generalizable mathematical reasoning—extending beyond prior dataset-centric optimization.
This framework also supplies a valuable resource for ongoing research into model interpretability, fine-grained curriculum learning, and compositional generalization—core challenges highlighted by analyses of step-level failures and memorization effects in recent LMMs. A plausible implication is that We-Math 2.0 will influence subsequent developments in multi-domain, knowledge-driven AI, providing a model-centric data space and training strategy that extends to other structured reasoning tasks.
7. Relationship to Prior Research
By integrating rigorous knowledge annotation, explicit hierarchical modeling, and curriculum-aligned reinforcement learning, We-Math 2.0 synthesizes advances loosely paralleled in prior mathematical reasoning resources, but extends them with systematic three-dimensional difficulty expansion, step-level solution mapping, and a unified, progressive training and benchmarking paradigm. This suggests a pathway towards lifting the standard from surface-level accuracy to interpretable, compositionally robust reasoning, particularly as visual complexity and step complexity grow (Qiao et al., 14 Aug 2025).