MathBook-RL: Curriculum for Multi-Modal Math
- MathBook-RL is a reinforcement learning framework that systematically enhances mathematical reasoning in multimodal large language models using a hierarchical knowledge system.
- It employs a two-stage training paradigm with cold-start fine-tuning and dynamic, curriculum-based reinforcement learning to address visual and symbolic problem solving.
- The framework demonstrates robust improvements on benchmarks such as MathVista, We-Math, and MathBookEval by incrementally increasing step, visual, and contextual complexities.
MathBook-RL is a reinforcement learning (RL) framework designed to systematically improve mathematical reasoning capabilities in multimodal LLMs (MLLMs), particularly in the context of the We-Math 2.0 system (Qiao et al., 14 Aug 2025). Built upon a hierarchical mathematical knowledge system, MathBook-RL employs a curriculum-based, progressive alignment approach to training MLLMs for robust visual and symbolic mathematical problem solving. The framework is structured around two principal stages: cold-start supervised fine-tuning and knowledge-driven RL leveraging dynamic scheduling across difficulty dimensions, yielding strong generalization and competitive benchmark performance.
1. Two-Stage MathBook-RL Training Paradigm
MathBook-RL is characterized by a two-stage RL training protocol targeting mathematical reasoning over multimodal data:
- Cold-Start Fine-Tuning:
- The model is initially fine-tuned on MathBook-Standard, a dataset sourced from a knowledge system organizing 491 knowledge points and 1,819 principles in a five-level hierarchy.
- Dual expansion strategies—“one-question-multi-image” and “one-image-multi-question”—enable broad coverage and flexibility.
- Fine-tuning samples are rewritten to incorporate explicit, knowledge-oriented chain-of-thought (CoT) explanations referencing core mathematical concepts.
- The objective is standard cross-entropy minimization:
- Progressive Alignment Reinforcement Learning:
- Following SFT, RL training is conducted in two phases:
- Pre-Aligned RL: Uses MathBook-Standard variants (ImgVar), enforcing group consistency via average-reward mechanism. For variants:
- ; (correct), $0.1$ (format-correct), $0$ (otherwise).
- Dynamic Scheduling RL: Constructs progressive curricula along three orthogonal dimensions: step complexity (), visual complexity (), and contextual complexity (). Curriculum paths are defined by incremental transformation:
- .
- Errors trigger targeted knowledge or modality increment schedules.
- Group Relative Policy Optimization (GRPO) objective:
2. Knowledge System and Dataset Structuring
MathBook-RL leverages a highly granular five-level hierarchical mathematical knowledge system to structure both training and evaluation. This system encompasses:
- 491 knowledge points spanning diverse mathematical domains.
- 1,819 fundamental principles, each mapped to specific chain-of-thought steps in model reasoning.
- MathBook-Standard: Standardized multi-modal seed problems with dual expansion for broad principle coverage.
- MathBook-Pro: Generated via progressive modeling in step, visual, and contextual difficulty dimensions; problems have 7 incrementally challenging variants.
This organization enables explicit supervision and alignment at each reasoning step, supporting systematic curriculum learning and fine-grained error attribution.
3. Progressive Curriculum and Dynamic Scheduling
Problem difficulty is parametrized in a three-dimensional space:
- Step Complexity (): Problems require more knowledge points or intermediate conclusions.
- Visual Complexity (): Diagrams augmented with auxiliary elements elevate visual reasoning demands.
- Contextual Complexity (): Statements are rewritten for rich real-world context embedding.
Curricula are constructed by transforming through composition of , , and . Errors on progressively transformed samples activate targeted incremental scheduling:
- Knowledge Increment Scheduling introduces auxiliary problems testing the missing knowledge point.
- Modality Increment Scheduling provides samples isolating the visual or contextual modality that caused error.
This incremental approach facilitates progressive alignment, bridging knowledge gaps and boosting transfer across modalities and difficulty gradients.
4. Experimental Evaluation and Benchmarking
MathBook-RL, instantiated in MathBook-7B, demonstrates superior or competitive performance on four standard mathematical reasoning benchmarks—MathVista, MathVision, We-Math, and MathVerse. Key results include:
- MathBook-7B outperforms its backbone (Qwen2.5-VL-7B) by more than 5% in average accuracy.
- On the We-Math benchmark, which evaluates both main problems and associated subproblems, MathBook-RL surpasses other RL-based baselines.
- On MathBookEval—a benchmark evaluating all 491 knowledge points and reasoning depths—MathBook-7B maintains robust performance, especially on problems requiring deep multi-step reasoning (7–10 steps), where other models exhibit notable degradation.
These results affirm that curriculum-driven progressive alignment and hierarchical knowledge supervision yield enhanced generalization and scalable performance across diverse mathematical domains and reasoning complexities.
5. Integration within the We-Math 2.0 System
MathBook-RL is a core component of We-Math 2.0, a comprehensive system for incentivizing and evaluating visual mathematical reasoning in MLLMs. Its interaction with key modules includes:
- The five-level MathBook Knowledge System supplies explicit chain-of-thought alignment and reasoning granularity for both SFT and RL.
- MathBook-Standard and MathBook-Pro datasets drive initial alignment and dynamic curriculum construction with full coverage of modalities and difficulty axes.
- MathBookEval provides exhaustive evaluation, ensuring improvements during training translate to tangible gains in generalization and problem coverage.
This unified workflow tightly links dataset generation, training protocols, and benchmarking, increasing transparency and interpretability.
6. Implications and Applications
MathBook-RL’s hierarchical, curriculum-aligned approach enables MLLMs to:
- Develop robust structured chain-of-thought reasoning for mathematical tasks, integrating explicit knowledge point supervision and dynamic error-bridging.
- Achieve domain generalization across step, visual, and contextual complexities, handling multi-step reasoning with resilience.
- Supply transparent and interpretable problem-solving strategies amenable to direct evaluation and extension.
A plausible implication is improved deployment in educational technology, intelligent tutoring systems, and other domains requiring systematic mathematical reasoning with clear stepwise explanations. The methods outlined, including progressive scheduling and group reward mechanisms, further suggest extensibility to other domains that feature structured reasoning tasks.
7. Future Directions
Future work could refine incremental scheduling granularity, develop more adaptive dynamic curricula, and further expand the knowledge hierarchy or integrate additional modalities. There is potential for analogous RL-based curriculum alignment strategies to be applied in complex reasoning tasks outside mathematics, broadening the impact of the MathBook-RL paradigm. The explicit, knowledge-driven structure and curriculum-based RL mechanisms position MathBook-RL as a model development strategy for scalable reasoning in both multimodal and symbolic AI systems.