Qwen2.5-Math: Mathematical Reasoning LLMs

Updated 3 May 2026

Qwen2.5-Math is a suite of math-specialized large language models built on the Qwen2.5 architecture, using synthetic data and reward-guided fine-tuning to enhance mathematical reasoning.
They support advanced chain-of-thought and tool-integrated reasoning in both English and Chinese, achieving high accuracy on benchmarks like GSM8K, MATH, and GaoKao.
Iterative self-improvement with reinforcement learning and critique fine-tuning boosts robustness, while in-situ correction techniques enable efficient low-bit deployment.

Qwen2.5-Math denotes a series of math-specialized LLMs built on the Qwen2.5 backbone and developed through a pipeline of synthetic data generation, iterative reward-model-guided supervised fine-tuning, and reinforcement learning. Designed as mathematical reasoning experts, Qwen2.5-Math variants exist at multiple parameter scales (1.5B, 7B, 72B, 32B) and support advanced chain-of-thought and tool-integrated reasoning in both English and Chinese. These models have set strong benchmarks across GSM8K, MATH, MMLU-STEM, AIME, GaoKao, and other competition-grade math evaluation sets. Qwen2.5-Math development involved comprehensive analyses of data curation, quantization robustness, reward modeling, and the use of critique-based training. The series also played a central role in exploring issues of data contamination and the limits of RL-induced reasoning gains in LLMs.

1. Model Architecture and Training Pipeline

All Qwen2.5-Math models inherit the generic Qwen2.5 architecture, specifically a transformer with Pre-LayerNorm and RMSNorm, rotary positional embeddings, and SwiGLU MLP activation. Core differences from base Qwen2.5 models are in data composition, choice of fine-tuning, and reward-guided learning. Parameter scales include 1.5B (24 layers, 4096 hidden), 7B (32 layers, 6144 hidden), and 72B (64 layers, 12288 hidden) (Yang et al., 2024).

The training protocol is an extended, multi-stage self-improvement loop:

Synthetic Pretraining: Math questions and CoT solutions extracted and evolved from high-quality sources via Qwen-72B-Instruct and MuggleMath. Corpora exceeded 1T tokens, with substantial bilingual and synthetic data (e.g., 580K English CoT, 500K Chinese CoT, 190K annotated tool-integrated problems).
Reward Model (RM) Construction: For each query, multiple reasoning candidates are sampled. RMs are trained to score and rank outputs with a listwise loss:

$\mathcal{L}_{\rm RM}(\phi) = -\,\frac{1}{k(6-k)} \,\mathbb{E}_{(x,y_{pos},y_{neg})\sim D} \Big[\log\sigma\big(r_\phi(x,y_{pos})-r_\phi(x,y_{neg})\big)\Big]$

where $k$ is the number of positives among six samples.

Iterative Self-Improvement: SFT data is prioritized using RM scores, retraining SFT and RM alternately across several rounds.
Reinforcement Learning (RL) Fine-Tuning: Final SFT models undergo RL via Group Relative Policy Optimization (GRPO) (Yang et al., 2024).

Post-training, during inference, the ultimate RM guides decoding (e.g., best-of-N RM reranking), optimizing performance on math reasoning tasks.

2. Mathematical Reasoning and Benchmark Performance

Qwen2.5-Math variants are evaluated using few-shot and zero-shot chain-of-thought prompting, as well as tool-integrated reasoning (code block execution). They have been benchmarked on GSM8K, MATH, Minerva, GaoKao, OlympiadBench, CollegeMath, AMC, AIME, and MMLU-STEM (Yang et al., 2024, Liu et al., 2024). Representative results (CoT greedy, RM@8, and TIR@8) for the instruct variants include:

Model	GSM8K	MATH	AIME24	OlympiadBench	MMLU-STEM	Avg.
Qwen2.5-Math-7B-Instr (TIR)	91.6%	55.4%	21/30	35/40	–	~69%
Qwen2.5-Math-72B-Instr	90.8%	66.8%	19/30	39/40	80.8%	71.9%
AceMath-72B-Instr	96.4%	86.1%	–	48.4%	85.4%	71.8%

On university-level evaluations such as U-MATH (1,100 problems across six core subjects), Qwen2.5-Math-72B achieved 59.0% accuracy on text problems and 10.5% on visual (image-embedded) problems, trailing multimodal models like Gemini-1.5-Pro (63.4% text, 45.0% visual) (Chernyshev et al., 2024).

3. Data Attribution, Influence Functions, and Data Curation

Analysis of Qwen2.5-Math’s reasoning proficiency revealed high sensitivity to the attributes of SFT data. Influence-based attribution (Infra) quantified per-example, per-sequence, and per-token impact on downstream accuracy using:

$I_f(z_m) = -\nabla_\theta f(\theta)^\mathsf{T} H^{-1} \nabla_\theta \mathcal{L}(z_m, \theta)$

This framework demonstrated:

High-difficulty math examples yield the largest positive influence on both mathematical and code reasoning.
Low-difficulty code tasks mainly benefit code performance.
Sequence-level “exploration” behaviors (e.g. generating alternative solutions) are the most beneficial reasoning steps for downstream robustness; ablating these drops accuracy.
At the token-level, influential tokens for math are natural language logic connectors (“Wait”, “However”, “Therefore”); for code, structural and syntax markers dominate.

Applying these insights, a “difficulty-flip” reweighting—replacing easy math and hard code with hard math and easy code—doubled AIME24 accuracy from 10% to 20% and improved LiveCodeBench from 33.8% to 35.3% (Kou et al., 26 May 2025).

4. Reinforcement Learning, Critique Fine-Tuning, and Tool Use

Beyond SFT and RM-guided scoring, Qwen2.5-Math has been exposed to several advanced post-training paradigms:

Critique Fine-Tuning (CFT): Rather than imitating correct outputs, CFT teaches models to critique noisy solutions. Using 50K GPT-4o-generated critiques, Qwen2.5-Math-7B with CFT surpasses SFT by 4–10% on mathematical benchmarks, matching or exceeding Qwen2.5-Math-Instruct (2.5M SFT) and AceMath (2.3M SFT) despite using ≈2% as much data. Ablations confirm CFT’s robustness to noise and its positive transfer to STEM and instruction-following tasks (Wang et al., 29 Jan 2025).
RL Algorithms: Group Relative Policy Optimization (GRPO) is used for RL with reward shaping that combines the RM score with a verifier’s judgement. Best-of-N sampling and RM reranking at inference are standard; chain-of-thought with tool-integrated reasoning (e.g., code block execution) is specifically supported to correct arithmetic or computational errors (Yang et al., 2024).
MCTS and Process Preference Models: rStar-Math, building on Qwen2.5-Math-7B as a policy SLM, employs Monte Carlo Tree Search guided by a process preference model. Verified code-augmented CoT rollouts and multiple rounds of self-evolution elevate MATH accuracy (pass@1) from 58.8% to 90%, surpassing o1-preview and matching much larger closed models (Guan et al., 8 Jan 2025).

5. Quantization Robustness and In-situ Correction

Low-bit quantization (AWQ, GPTQ, SmoothQuant; 4–8 bits) can degrade Qwen2.5-Math’s complex math accuracy by as much as 70%, especially for smaller models. Largest losses are observed in multi-step chains where procedural (execution/method) errors dominate. However, a light-weight in-situ Direct Preference Optimization (DPO) fine-tuning using a compact “Silver-Bullet” set (332 triples) suffices to restore or surpass full-precision MATH accuracy in under five minutes on a single A100-40GB GPU (Li et al., 16 May 2025). This shows the feasibility of practical low-bit deployment with targeted post-quantization correction.

6. Benchmark Contamination, RL Misconceptions, and Evaluation Protocols

Analysis has revealed substantial data contamination in benchmarks like MATH-500, AMC, and AIME for Qwen2.5-Math, with partial-prompt exact match rates indicating memorization rather than genuine reasoning:

For Qwen2.5-Math-7B, EM(0.6)=54.6%, PA(0.6)=53.6% (r=60% of prompt). Comparable rates for Llama3.1-8B stay below 5% (Wu et al., 14 Jul 2025).
RL fine-tuning under spurious reward signals artificially boosts accuracy on contaminated benchmarks via output memorization.
Clean synthetic benchmarks (e.g., RandomCalculation) reveal that only correct, verifiable rewards drive stable accuracy improvements; random or inverted reward signals fail to improve math reasoning.
Recommendations include routine leakage audits, favoring clean/synthetic benchmarks, and cross-family evaluation for RL research on LLM mathematical reasoning (Wu et al., 14 Jul 2025).

7. Comparative Models and Extensions

AceMath-72B-Instruct surpasses Qwen2.5-Math-72B-Instruct by 3.7 points on average across seven math benchmarks. Key differences include a two-stage general SFT before math-only SFT, curated synthetic prompts, and high-quality RM design (Liu et al., 2024).
PCL-Reasoner-V1.5 (Qwen2.5-32B base) utilizes an efficient offline RL procedure, reaching 90.9% on AIME-2024 and 85.6% on AIME-2025, and demonstrates improved stability relative to online RL (GRPO) (Lu et al., 21 Jan 2026).
Small-scale variants (e.g., Qwen2.5-0.5B) can achieve strong performance on arithmetic reasoning (Countdown), especially with DPO or RLOO (DeBERTa reward model) fine-tuning coupled with inference-time best-of-N sampling and external verification (Han et al., 11 Jun 2025).

Qwen2.5-Math exemplifies a modern paradigm in domain-specialized LLM construction: iterative synthetic data curation, RM- and RL-driven self-improvement, tool-augmented reasoning, and fine-grained data and attribution analysis. While attaining state-of-the-art results on numerous benchmarks, it has also served as a crucial lens for examining training and evaluation pitfalls, highlighting both the promise and the complexity of LLMs as mathematical reasoners.