Qwen-2.5-Math Models: Advanced Math Reasoning

Updated 10 November 2025

Qwen-2.5-Math Models are mathematics-specialized large language models designed for advanced, multilingual reasoning from grade-school to competition levels.
They integrate domain-targeted pre-training, chain-of-thought supervised fine-tuning, and reinforcement learning via reward models to optimize long-chain mathematical reasoning.
Innovative architectural enhancements, low-rank distillation, and tool-integrated reasoning techniques boost both accuracy and inference efficiency.

Qwen-2.5-Math models are a family of mathematics-specialized LLMs developed within the Qwen2.5 series that advance state-of-the-art performance on mathematical reasoning across English and Chinese benchmarks spanning grade-school to math competition level. These models implement domain-targeted pre-training, iterative reward model–supervised fine-tuning, reinforcement learning, and reward-guided inference strategies. Several architectural, training, and inference improvements underlie Qwen-2.5-Math’s strong results, as well as practical tools and low-rank distillation methods that address the unique computational and evaluation demands of long-chain mathematical reasoning.

1. Model Architecture and Technical Innovations

Qwen-2.5-Math is built on the decoder-only Transformer architecture family utilized in Qwen2.5 models, incorporating pre-layer normalization, rotary position embeddings (RoPE), GELU activations, and optimized attention mechanisms such as flash and paged attention for high-throughput inference.

Three principal model sizes are released:

1.5B parameters (≈24 layers, hidden size ≈2048, 32 attention heads)
7B parameters (≈32 layers, hidden size ≈4096, 32 heads)
72B parameters (≈64 layers, hidden size ≈8192, 64 heads)

Compared to non-specialist Qwen2.5 models, the Qwen2.5-Math series is distinguished by:

Extended pre-training on a large, math-enriched corpus (see Section 2)
Context length enlarged to 4K tokens (enabling longer reasoning chains)
Integration of externally trained reward model (RM) signals both during supervised fine-tuning (SFT), reinforcement learning (GRPO-based policy optimization), and inference reranking

In earlier work, math-specialist models based on Qwen-Chat and Code-Qwen architectures used untied input/output embeddings, RoPE with FP32 inverse-frequency matrix, RMSNorm pre-norm, SwiGLU activation, FFN dimension expansion to $(8/3)d$ , and FlashAttention. Self-attention and FFN operations follow standard formulations,

$\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\left(\frac{QK^\intercal}{\sqrt{d/h}}\right)V,$

$\mathrm{FFN}(x) = W_2 \cdot \mathrm{SwiGLU}(W_1 x).$

2. Data Generation, Pre-training, and Mathematical Supervision

The pre-training and data construction pipeline for Qwen2.5-Math leverages Qwen Math Corpus v2, a >1T token resource combining:

High-quality, math-focused web data curated by FastText classification and LM-based filtering
Encyclopedic resources, exam archives, and code samples
Massive-scale synthetic Q&A pairs in both English and Chinese, generated by Qwen2-Math-Instruct to augment low-resource domains

Pre-training is initialized from generalist Qwen2.5 checkpoints and proceeds with autoregressive next-token loss on mixed mathematical and general text. This phase dramatically improves generation of LaTeX-rich, multi-step mathematical discourse.

Supervised fine-tuning (SFT) is performed on curated data using chain-of-thought and tool-integrated reasoning targets. The SFT optimizer is AdamW with $\beta_1 = 0.9$ , $\beta_2 = 0.95$ , $\epsilon = 10^{-8}$ , and a cosine-annealed peak learning rate of $2 \times 10^{-5}$ . Sequence length during SFT is set to 1024.

For further post-training, Qwen2.5-Math-Instruct models are subjected to reinforcement learning via Group Relative Policy Optimization (GRPO), using a reward model trained on pairwise and listwise ranking losses. The reward model head replaces the LM head and is trained to rank correct over incorrect solutions: $\mathcal{L}_{\text{RM}}(\theta) = -\frac{1}{k(6-k)} \mathbb{E}_{(x, y_{\text{pos}}, y_{\text{neg}})} \log \sigma \left( r_\theta(x, y_{\text{pos}}) - r_\theta(x, y_{\text{neg}}) \right)$

Supervised and RL fine-tuning proceeds in an iterative, self-improving loop:

Generate candidate solutions using the current SFT model.
Score candidates with the updated RM; filter or select top- $k$ reasoning traces.
Fine-tune the next SFT iteration on accepted outputs.
Re-train RM on freshly generated model responses.
Repeat until model or RM convergence is achieved.

3. Reward Modeling, Noisy Feedback, and Robustness

Qwen-2.5-Math models implement advanced reward model pipelines to guide both training and inference. The RM is initially trained from SFT checkpoints, then iteratively refined as SFT performance increases. During RL, Group Relative Policy Optimization (GRPO) is used; for each question a batch of $G$ model responses are generated, rewarded, and the GRPO policy objective is optimized using normalized advantages and KL penalties to a reference policy.

Reward shaping during RL incorporates both the RM output and a sparse symbolic correctness verifier: $r = \sigma(\alpha r_m) + (r_v - 1),\quad \alpha=0.5$

In related research, the effect of noisy reward signals has been investigated. Experiments with Qwen-2.5-7B demonstrate that RL with up to $40\%$ reward flipping ( $\eta=0.4$ ) retains high math accuracy, converging to $72\%$ on MATH-500 versus $75.85\%$ for noiseless RL. At total noise ( $\eta=0.5$ ), training collapses. This suggests substantial robustness to reward noise, providing that a strong pre-trained reasoning backbone is present.

Moreover, “Reasoning Pattern Reward” (RPR)—reward based solely on the presence of chain-of-thought phrases—delivers comparable MATH-500 accuracy (peak $70.21\%$ ) to RL with answer verification. RPR is defined as: $\mathrm{RPR}_0(s) = \frac{1}{n}\sum_{i=1}^n \mathbb{I}[\text{phrase}_i \in s]$ RPR can be used as an auxiliary reward to calibrate RM outputs, especially where RM false negatives may penalize valid reasoning.

4. Inference Methodologies and Tool-Integrated Reasoning

During inference, Qwen2.5-Math supports reward-guided decoding via RM reranking (RM@ $N$ ). For each prompt, $N$ samples are drawn (using nucleus or top- $p$ sampling at various temperatures), then scored by the RM. The highest-scoring sample is output. Pseudocode:

for i in 1..N:
    y_i = sample(model, prompt)
    score_i = RM(prompt, y_i)
return y_argmax(score_i)

This reranking approach consistently improves pass@1 accuracy by 2–8 percentage points compared to majority voting (Maj@ $N$ ).

Qwen2.5-Math-Instruct models additionally support both Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR) during supervised, RL, and inference stages:

CoT prompts elicit multistep natural language reasoning.
TIR prompts interleave code execution agents (e.g., Python code using sympy for symbolic algebra), allowing the model to perform exact calculations and algebraic manipulations as part of its solution.

Bilingual support is explicitly engineered, with millions of English and Chinese math QA pairs and both hand-annotated and synthesized TIR examples.

5. Scaling Behavior, Efficiency, and Low-Rank Distillation

The Qwen-2.5-Math series demonstrates strong scaling benefits. Larger base models and math-specialist fine-tuning both yield substantial gains: moving from 1.5B to 7B to 72B parameters boosts grades across all benchmarks, with SFT/RL-enhanced models significantly outperforming base checkpoints.

Efficient inference remains a challenge for long-form math reasoning. The Caprese method addresses this by integrating low-rank distillation atop aggressive feedforward sparsification (methods such as GRIFFIN or CATS). Specifically, Caprese inserts low-rank ( $r=256$ ) adapters (adding only ≈1% extra parameters) in frozen FFN blocks and distills either:

Locally (layerwise MSE loss): $\min_{L,R} \sum_{x\in X} \|FF(x)-\hat{FF}(x)-xLR\|_2^2$
Globally (final embedding MSE): $\min_{\{L_i,R_i\}} \sum_{x\in X} \|M_\text{teacher}(x)-M_\text{student}(x)\|_2^2$

On Qwen 2.5 14B, Caprese nearly fully recovers math accuracy lost under sparsification, e.g., MATH-500: full model 92.80%, GRIFFIN 89.80%, Caprese+GRIFFIN 89.20%. Latency is reduced 8–13% on 2K–8K token generations, and chain-of-thought outputs become shorter, which decreases memory pressure.

6. Performance on Benchmarks

The Qwen-2.5-Math models establish new state-of-the-art results among open-source math LLMs when evaluated on a suite of multilingual and competition-level datasets. Notable few-shot and best-of- $N$ pass@1 results are:

Model	GSM8K	MATH	CMATH	GaoKao Cloze	GaoKao QA
Qwen2.5-1.5B	76.8	49.8	83.0	47.5	54.1
Qwen2.5-7B	91.6	55.4	85.0	57.6	69.5
Qwen2.5-72B	90.8	66.8	89.7	72.9	86.3

Instruct models with reward-guided best-of- $N$ sampling (RM@8) further improve CoT pass@1 scores: 68.9% for Qwen2.5-7B-Instruct and 70.8% for Qwen2.5-72B-Instruct.
Tool-Integrated Reasoning (TIR) decoding provides additional accuracy gains on MATH and competition datasets, reaching 97.6% (Qwen2.5-7B, RM@8) and 93.3% (Qwen2.5-1.5B, RM@8).
On Chinese-specific tasks, post-training tailored to ZH content yields up to +17 points over GPT-4o.

Competition evaluations: | Model | AIME 24 | AMC 23 | |------------------------------------|---------|--------| | Qwen2.5-1.5B (TIR, RM@64) | 18/30 | 36/40 | | Qwen2.5-7B (TIR, RM@64) | 21/30 | 35/40 | | Qwen2.5-72B (TIR, RM@64) | 18/30 | 37/40 |

7. Implications, Limitations, and Future Directions

Implementation of iterative supervised and reward model–guided training enables Qwen-2.5-Math to approach or surpass prior open-source records, especially for multilingual and code-augmented reasoning. Robustness to noisy signal and the ability to guide inference via RPR or RM reranking have significant practical benefits where reward modeling or answer verification is imperfect.

Fine-tuning and RL strategies that explicitly reward reasoning traces enable the models to generalize robustly to long, open-ended mathematical tasks. At the same time, the scaling behavior evidenced in results from 1.5B to 72B parameters highlights the importance of both parameter count and domain-specialization for mathematical expertise.

Efficient inference and low-rank distillation via Caprese methods substantially decrease latency and accelerate deployment without appreciable loss of mathematical reasoning power.

A plausible implication is that further advances will depend on more sophisticated RM architectures, better reward shaping for open-ended domains, improved multilingual and mathematical tool integration, and continued synthetic data generation. The demonstrated ability to recover accuracy under noisy or partial supervision suggests promising directions for scalable, robust LLMs for mathematical and technical reasoning (Bai et al., 2023, Lv et al., 28 May 2025, Yang et al., 18 Sep 2024, Dong et al., 8 May 2025).