MathForge: Enhancing Math Reasoning

Updated 30 January 2026

MathForge is a dual-framework approach that enhances LLM mathematical reasoning by prioritizing hard but solvable questions through integrated data augmentation and optimization.
Its Difficulty-Aware Group Policy Optimization (DGPO) uses mean-absolute-deviation to balance gradient updates, ensuring consistent learning from challenging instances.
Multi-Aspect Question Reformulation (MQR) systematically generates intrinsically harder question variants, boosting model generalization on rigorous mathematical benchmarks.

MathForge is a dual-framework approach for augmenting and optimizing mathematical reasoning in LLMs by targeting and prioritizing harder problems throughout both the data augmentation and learning processes. MathForge synergistically combines Difficulty-Aware Group Policy Optimization (DGPO), which addresses core algorithmic imbalances in reinforcement learning with verifiable rewards (RLVR), and Multi-Aspect Question Reformulation (MQR), which systematically expands the pool of challenging mathematical questions while preserving answer correctness. The methodology demonstrably improves generalization and performance on mathematical benchmarks by refocusing learning dynamics and dataset composition toward "hard but solvable" centers of capability (Dai et al., 28 Jan 2026).

1. Motivation and Theoretical Foundations

MathForge arises from the observation that prevailing RLVR pipelines—most notably those employing Group Relative Policy Optimization (GRPO)—tend to underemphasize intrinsically hard questions both in data and optimization. GRPO, which organizes responses into groups per question and normalizes rewards by their standard deviation, induces maximal update strength at intermediate difficulty (rewards accuracy $p \approx 0.5$ ), but vanishes for extremes ( $p \to 0, 1$ ). This produces skewed gradient flows away from domains that probe model weaknesses. Concurrently, augmentation strategies have primarily diversified data via paraphrasing without systematically increasing intrinsic, not merely superficial, difficulty.

Mathematically, the uncapped update magnitude in GRPO is $\sum_i |\hat{A}_{GR,i}| = 2G \sqrt{p(1-p)}$ , where $G$ is the group size and $p$ question accuracy, with update mass sharply attenuated on both trivially easy and genuinely challenging problems. MathForge postulates that learning is maximized from hard-but-solvable queries—those with at least one correct answer—which are rare under both standard curriculum and reward structures (Dai et al., 28 Jan 2026).

2. Difficulty-Aware Group Policy Optimization (DGPO)

DGPO is engineered to eliminate the gradient suppression endemic to GRPO and directly incentivize learning from hard instances. It comprises two core mechanisms:

Difficulty-Balanced Group Advantage Estimation (DGAE):

Replaces GRPO's standard deviation normalization with mean-absolute-deviation (MAD):

Group advantage becomes $\hat{A}_{DG,i} = (r_i - \mu)/\text{MAD}(r)$ , with $\mu = \frac{1}{G}\sum_i r_i$ and $\text{MAD}(r) = \frac{1}{G}\sum_i |r_i - \mu|$ .
The resulting sum $\sum_i |\hat{A}_{DG,i}| = G$ is invariant under $p$ , guaranteeing equalized per-question update magnitude regardless of empirical difficulty.

Difficulty-Aware Question-Level Weighting (DQW):

In a batch of $p \to 0, 1$ 0 valid questions $p \to 0, 1$ 1, each receives a difficulty score $p \to 0, 1$ 2 and weight

$p \to 0, 1$ 3, with temperature $p \to 0, 1$ 4.
Lower mean reward (i.e., higher intrinsic difficulty) results in exponentially higher weighting.

Full DGPO Loss:

The batchwise objective is

$p \to 0, 1$ 5

with $p \to 0, 1$ 6 as the token-level importance weight.

Empirical Effects:

Ablations indicate DGAE and DQW yield additive improvements in pass@1 accuracy over baseline GRPO, specifically: DGAE +0.94%, DQW +1.14% on the MATH benchmark (Dai et al., 28 Jan 2026).

3. Multi-Aspect Question Reformulation (MQR)

MQR systematically expands the dataset into intrinsically harder variants, targeting three difficulty axes per original question:

Aspect	Description	Example Modification
Background	Embed irrelevant context	Add unrelated story/background detail, increasing lexical noise
Term	Symbolic abstraction	Define new symbol/concept (e.g., "euro-gap $p \to 0, 1$ 7") to abstract
Sub-Problem	Nested reasoning	Prepend/extract a key sub-task, requiring multi-step reasoning

The reformulation workflow utilizes a simple prompting pipeline with a writer LLM (e.g., OpenAI o3 or smaller open-source), applying three fixed instructions to generate each aspect while constraining the gold answer to remain invariant throughout all variants. Notably, no solution regeneration is required, minimizing computational overhead.

4. Integration and Synergistic Optimization

MathForge constructs a synergistic training loop where original and MQR-augmented questions are interleaved within RLVR batches. DGPO's update mechanics—equalizing per-question contributions and further up-weighting harder, MQR-derived instances—ensure these challenging samples have pronounced impact on policy improvement. Empirical evidence demonstrates that such integrated training produces lower training-set accuracy (reflecting problem difficulty), concomitantly elevating generalization and test accuracy ("train harder, test better").

DGPO’s valid token-level gradient averaging and batchwise question weighting sustain stable optimization dynamics, as confirmed through convergence plots showing widening accuracy gaps relative to GRPO baselines on both original and enhanced datasets (Dai et al., 28 Jan 2026).

5. Empirical Evaluation and Benchmarking

Extensive experiments span both text-only and multimodal mathematical reasoning tasks:

Setting	Baseline	+GRPO	+DGPO	+MQR (with GRPO)	MathForge (DGPO+MQR)
Qwen2.5-Math 7B (MATH)	22.04%	37.61%	39.79%	41.04%	42.17%

Key details:

Datasets: MATH (training), AIME 24/25, AMC 23, MATH500, Minerva, Olympiad (zero-shot evaluation), GEOQA-8k (multimodal)
Models: Qwen2.5-Math 7B (main), Qwen2.5-Math 1.5B, Qwen2.5-3B, DeepSeek-Math 7B, Qwen2.5-VL-3B-Instruct.
Metrics: Pass@1 accuracy, averaged over up to 32 runs per task
Multimodal: For GEOQA-8k, DGPO increased accuracy from 57.43% (GRPO) to 59.95%.
DGPO and MQR effects are additive; combining both yields the highest performance (Dai et al., 28 Jan 2026).

6. Analysis, Limitations, and Future Extensions

Harder questions facilitate the exposure and repair of partial knowledge gaps, delivering richer policy gradients per correct instance and promoting concise logical chains. MathForge’s gains rely on the consistent, answer-invariant reformulation ability of the underlying LLM; quality may deteriorate if reformulator models underperform relative to the complexity of the original data. Excessively high temperature or aggressive difficulty weighting may under-serve medium-difficulty questions.

Further, the required computational budget for MQR (≈\$184 to process 22.5k prompts at current commercial rates) and dependence on prompt-engineered LLMs are pertinent resource considerations.

Noted future directions include:

Alternative, model-agnostic difficulty metrics (e.g., proof length, solution tree depth)
Dynamic adjustment (curriculum learning) of difficulty weights
Application of MQR to other scientific domains and modalities
Combination with self-play or variational question generation for novel, hard question synthesis

This suggests MathForge’s methodology is extensible across domains, contingent on the existence of verifiable reward signals and effective question reformulation pipelines (Dai et al., 28 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MathForge.