MathForge: Enhancing Math Reasoning
- MathForge is a dual-framework approach that enhances LLM mathematical reasoning by prioritizing hard but solvable questions through integrated data augmentation and optimization.
- Its Difficulty-Aware Group Policy Optimization (DGPO) uses mean-absolute-deviation to balance gradient updates, ensuring consistent learning from challenging instances.
- Multi-Aspect Question Reformulation (MQR) systematically generates intrinsically harder question variants, boosting model generalization on rigorous mathematical benchmarks.
MathForge is a dual-framework approach for augmenting and optimizing mathematical reasoning in LLMs by targeting and prioritizing harder problems throughout both the data augmentation and learning processes. MathForge synergistically combines Difficulty-Aware Group Policy Optimization (DGPO), which addresses core algorithmic imbalances in reinforcement learning with verifiable rewards (RLVR), and Multi-Aspect Question Reformulation (MQR), which systematically expands the pool of challenging mathematical questions while preserving answer correctness. The methodology demonstrably improves generalization and performance on mathematical benchmarks by refocusing learning dynamics and dataset composition toward "hard but solvable" centers of capability (Dai et al., 28 Jan 2026).
1. Motivation and Theoretical Foundations
MathForge arises from the observation that prevailing RLVR pipelines—most notably those employing Group Relative Policy Optimization (GRPO)—tend to underemphasize intrinsically hard questions both in data and optimization. GRPO, which organizes responses into groups per question and normalizes rewards by their standard deviation, induces maximal update strength at intermediate difficulty (rewards accuracy ), but vanishes for extremes (). This produces skewed gradient flows away from domains that probe model weaknesses. Concurrently, augmentation strategies have primarily diversified data via paraphrasing without systematically increasing intrinsic, not merely superficial, difficulty.
Mathematically, the uncapped update magnitude in GRPO is , where is the group size and question accuracy, with update mass sharply attenuated on both trivially easy and genuinely challenging problems. MathForge postulates that learning is maximized from hard-but-solvable queries—those with at least one correct answer—which are rare under both standard curriculum and reward structures (Dai et al., 28 Jan 2026).
2. Difficulty-Aware Group Policy Optimization (DGPO)
DGPO is engineered to eliminate the gradient suppression endemic to GRPO and directly incentivize learning from hard instances. It comprises two core mechanisms:
Difficulty-Balanced Group Advantage Estimation (DGAE):
Replaces GRPO's standard deviation normalization with mean-absolute-deviation (MAD):
- Group advantage becomes , with and .
- The resulting sum is invariant under , guaranteeing equalized per-question update magnitude regardless of empirical difficulty.
Difficulty-Aware Question-Level Weighting (DQW):
In a batch of valid questions , each receives a difficulty score and weight
- , with temperature .
- Lower mean reward (i.e., higher intrinsic difficulty) results in exponentially higher weighting.
Full DGPO Loss:
The batchwise objective is
with as the token-level importance weight.
Empirical Effects:
Ablations indicate DGAE and DQW yield additive improvements in pass@1 accuracy over baseline GRPO, specifically: DGAE +0.94%, DQW +1.14% on the MATH benchmark (Dai et al., 28 Jan 2026).
3. Multi-Aspect Question Reformulation (MQR)
MQR systematically expands the dataset into intrinsically harder variants, targeting three difficulty axes per original question:
| Aspect | Description | Example Modification |
|---|---|---|
| Background | Embed irrelevant context | Add unrelated story/background detail, increasing lexical noise |
| Term | Symbolic abstraction | Define new symbol/concept (e.g., "euro-gap ") to abstract |
| Sub-Problem | Nested reasoning | Prepend/extract a key sub-task, requiring multi-step reasoning |
The reformulation workflow utilizes a simple prompting pipeline with a writer LLM (e.g., OpenAI o3 or smaller open-source), applying three fixed instructions to generate each aspect while constraining the gold answer to remain invariant throughout all variants. Notably, no solution regeneration is required, minimizing computational overhead.
4. Integration and Synergistic Optimization
MathForge constructs a synergistic training loop where original and MQR-augmented questions are interleaved within RLVR batches. DGPO's update mechanics—equalizing per-question contributions and further up-weighting harder, MQR-derived instances—ensure these challenging samples have pronounced impact on policy improvement. Empirical evidence demonstrates that such integrated training produces lower training-set accuracy (reflecting problem difficulty), concomitantly elevating generalization and test accuracy ("train harder, test better").
DGPO’s valid token-level gradient averaging and batchwise question weighting sustain stable optimization dynamics, as confirmed through convergence plots showing widening accuracy gaps relative to GRPO baselines on both original and enhanced datasets (Dai et al., 28 Jan 2026).
5. Empirical Evaluation and Benchmarking
Extensive experiments span both text-only and multimodal mathematical reasoning tasks:
| Setting | Baseline | +GRPO | +DGPO | +MQR (with GRPO) | MathForge (DGPO+MQR) |
|---|---|---|---|---|---|
| Qwen2.5-Math 7B (MATH) | 22.04% | 37.61% | 39.79% | 41.04% | 42.17% |
Key details:
- Datasets: MATH (training), AIME 24/25, AMC 23, MATH500, Minerva, Olympiad (zero-shot evaluation), GEOQA-8k (multimodal)
- Models: Qwen2.5-Math 7B (main), Qwen2.5-Math 1.5B, Qwen2.5-3B, DeepSeek-Math 7B, Qwen2.5-VL-3B-Instruct.
- Metrics: Pass@1 accuracy, averaged over up to 32 runs per task
- Multimodal: For GEOQA-8k, DGPO increased accuracy from 57.43% (GRPO) to 59.95%.
- DGPO and MQR effects are additive; combining both yields the highest performance (Dai et al., 28 Jan 2026).
6. Analysis, Limitations, and Future Extensions
Harder questions facilitate the exposure and repair of partial knowledge gaps, delivering richer policy gradients per correct instance and promoting concise logical chains. MathForge’s gains rely on the consistent, answer-invariant reformulation ability of the underlying LLM; quality may deteriorate if reformulator models underperform relative to the complexity of the original data. Excessively high temperature or aggressive difficulty weighting may under-serve medium-difficulty questions.
Further, the required computational budget for MQR (≈\$184 to process 22.5k prompts at current commercial rates) and dependence on prompt-engineered LLMs are pertinent resource considerations.
Noted future directions include:
- Alternative, model-agnostic difficulty metrics (e.g., proof length, solution tree depth)
- Dynamic adjustment (curriculum learning) of difficulty weights
- Application of MQR to other scientific domains and modalities
- Combination with self-play or variational question generation for novel, hard question synthesis
This suggests MathForge’s methodology is extensible across domains, contingent on the existence of verifiable reward signals and effective question reformulation pipelines (Dai et al., 28 Jan 2026).