Qwen2.5-Math-1.5B Model

Updated 6 February 2026

Qwen2.5-Math-1.5B is a math-specialized large language model built on a 1.5B parameter dense Transformer, supporting both English and Chinese.
It integrates a self-improvement pipeline comprising data-driven supervised fine-tuning, reward modeling, and reinforcement learning for enhanced accuracy.
Evaluations on major math benchmarks show state-of-the-art performance, with best-of-N sampling and tool-integrated reasoning enabling robust mathematical insights.

The Qwen2.5-Math-1.5B model is a math-specialized, open-source LLM derived from the Qwen2.5 series, designed for advanced mathematical reasoning tasks in both English and Chinese. Built on a dense 1.5 billion parameter Transformer backbone, it integrates a self-improvement paradigm across data generation, supervised fine-tuning, reward modeling, and reinforcement learning. This architecture and methodology facilitate state-of-the-art performance for its size on a wide array of mathematical challenge benchmarks, bridging the gap between inference efficiency and mathematical expressiveness (Qwen et al., 2024, Yang et al., 2024, Chen et al., 23 May 2025).

1. Model Architecture and Parameterization

Qwen2.5-Math-1.5B adheres to the dense decoder-only Transformer blueprint established by the Qwen2.5 series, implemented without mixture-of-experts or adapter modules. Initialization is from the Qwen2.5-1.5B base, with slight variations in architectural hyperparameters reported across sources. The canonical configuration given in the Qwen2.5-Math technical report includes:

24 transformer decoder layers
Hidden dimension: 2048
Attention heads: 16
MLP inner dimension: 8192
Rotary positional embeddings (RoPE) for long-context scaling
Total parameters: ≈1.5 billion

The Qwen2.5 technical report describes the base as 28 layers with Grouped Query Attention (12 query, 2 key/value heads), 32k context length, and tied embeddings. In both accounts, there are no model-specific architectural changes tailored for math; instead, mathematical proficiency arises from data and training pipeline design (Qwen et al., 2024, Yang et al., 2024).

2. Training Corpus, Data Generation, and Pre-training

The pre-training pipeline leverages curated and synthetic data emphasizing mathematical reasoning. Principal data sources and synthesis steps are:

Use of Qwen2-Math-Instruct-72B to generate a 700B token "Qwen Math Corpus v1," combining web-scraped mathematics, code snippets, exam questions, with filtering via FastText classifiers and small LMs.
Expansion to "Qwen Math Corpus v2" >1T tokens: additional Chinese and English problem aggregation, synthetic CoT and TIR examples, and domain mixing (web, code, encyclopedias, competitions).
Rigorous decontamination: 13-gram and LCS filtering to avoid test set leakage.

Pre-training is accomplished using a standard autoregressive language modeling objective,

$L(\theta) = -\sum_{t} \log P_\theta(x_t | x_{<t})$

with a maximum context length of 4096 tokens, AdamW optimization, and phased learning rate decay (Yang et al., 2024).

3. Self-Improvement: Fine-Tuning, Reward Modeling, and Reinforcement Learning

Iterative Supervised Fine-Tuning (SFT)

The SFT phase deploys a large-scale, chain-of-thought-enriched corpus:

2,000K English and 500K Chinese CoT problems (annotated + MuggleMath synthesized), each with multiple candidate reasoning chains, selected by reward scoring.
Approximately 190K annotated + 205K synthesized TIR problems, several with executed code steps and verifiable outputs; 75K translated into Chinese.
SFT is performed for 3 epochs with sequence length 4096, initial learning rate 2×10⁻⁵, batch size 128 decaying to 7×10⁻⁷ (Yang et al., 2024).

Reward Model Construction

Training set: 361K English + 257K Chinese problems, each with 6 candidate solutions from intermediate SFT models, labeled as correct/incorrect by final answer match.
Architecture matches the SFT model backbone, adding a scalar head.
Pairwise listwise ranking loss:

$L_\mathrm{rm}(\theta) = -\frac{1}{k(6-k)} \mathbb{E}_{(x, y_\mathrm{pos}, y_\mathrm{neg})} \left[ \log \sigma (r_\theta(x, y_\mathrm{pos}) - r_\theta(x, y_\mathrm{neg})) \right]$

where $k$ is the number of positive candidates (Yang et al., 2024).

Reinforcement Learning with GRPO

GRPO (Group Relative Policy Optimization) is used, with query eligibility based on response correctness counts, and reward shaping combining the RM score and a binary sparse verifier.
Hyperparameters include a KL divergence coefficient of 1e–3, sampling 32 responses per query, global batch size 512, and learning rate 1×10⁻⁵.

4. Bilingual Reasoning and Tool Integration

Qwen2.5-Math-1.5B is natively bilingual, supporting both English and Chinese mathematical queries at all stages—data generation, SFT, RM training, and evaluation. Prompt templates are designed for each language. Tool-Integrated Reasoning (TIR) is enabled via structured prompts that elicit Python code, with an inference pipeline executing generated code and masking execution artifacts. This dual-mode (chain-of-thought and tool reasoning) setting allows advanced mathematical insight with automated verification or computation (Yang et al., 2024).

5. Inference Methodologies: Sampling and Answer Selection

Inference incorporates advanced sampling and selection approaches to optimize correctness:

Best-of-N Sampling: Generate $N$ $N$ candidate solutions per prompt.
- Maj@N: Choose the final answer mode.
- RM@N: Choose the solution with the highest RM score.

Chain-of-thought (CoT) prompting—typically with a zero-shot "Let’s think step by step" cue—substantially augments solution quality across datasets. TIR mode further boosts accuracy on code-executable problems (Yang et al., 2024).

6. Evaluation across Benchmarks

Qwen2.5-Math-1.5B demonstrates strong results on major English and Chinese math benchmarks, setting new open-source records at its scale for several tasks. Key results under zero-shot CoT (pass@1) or best-of-N inference:

Benchmark	Greedy	Maj@8	RM@8
GSM8K (English)	84.8%	89.5%	94.1%
MATH (English)	75.8%	80.3%	83.9%
GaoKao 2023 En (Eng)	65.5%	68.8%	73.0%
OlympiadBench	38.1%	43.9%	47.3%
CollegeMath	47.7%	48.9%	50.2%
MMLU-STEM	57.5%	59.5%	65.2%
AIME 2024 (best@256 TIR)			19/30
AMC 2023 (best@256 TIR)			36/40

For Chinese data:

GaoKao Math: 62.4% (greedy), 67.5% (RM@8)
CMATH: 89.7% (greedy), 94.0% (RM@8)
CN Middle School 24: 76.2% (greedy), 80.2% (RM@8)

The 1.5B model surpasses previous open-source LLMs below 7B parameters and approaches the performance of larger 70B models under best-of-N selection. RM scoring consistently outperforms simple majority voting, especially at higher $N$ (Yang et al., 2024).

7. Sample-Efficient Fine-Tuning: Re-distillation Approach

Recent analysis has demonstrated that with as few as ≈500 high-effect chain-of-thought solutions distilled from a well-trained RL policy (obtained via GRPO), Qwen2.5-1.5B can recover MATH pass@1 accuracy essentially matching models trained with full-scale RL. The re-distillation procedure involves:

Training RL policy on MATH via GRPO.
Generating ≈2000 RL rollouts; filtering and sampling 496 correct, formatted trajectories.
Fine-tuning the base model for 2 epochs with standard cross-entropy.
Zero-shot evaluation yields 54.4% pass@1 on MATH-500, comparable to Instruct/RL-tuned baselines.

Empirical findings confirm the "sample-effect" theoretical framework: high-effect samples (as per RL optimization) rapidly align test-accuracy gradients, making small-sample SFT approximately as effective as full RL for downstream accuracy. This suggests a practical and efficient route for adapting models to mathematical reasoning tasks, with provable gradient-alignment benefits (Chen et al., 23 May 2025).

References

Qwen2.5 Technical Report (Qwen et al., 2024)
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement (Yang et al., 2024)
Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning (Chen et al., 23 May 2025)

Markdown Report Issue Upgrade to Chat

References (3)

Qwen2.5 Technical Report (2024)

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement (2024)

Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-Math-1.5B Model.

Qwen2.5-Math-1.5B Model

1. Model Architecture and Parameterization

2. Training Corpus, Data Generation, and Pre-training

3. Self-Improvement: Fine-Tuning, Reward Modeling, and Reinforcement Learning

Iterative Supervised Fine-Tuning (SFT)

Reward Model Construction

Reinforcement Learning with GRPO

4. Bilingual Reasoning and Tool Integration

5. Inference Methodologies: Sampling and Answer Selection

6. Evaluation across Benchmarks

7. Sample-Efficient Fine-Tuning: Re-distillation Approach

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Qwen2.5-Math-1.5B Model

1. Model Architecture and Parameterization

2. Training Corpus, Data Generation, and Pre-training

3. Self-Improvement: Fine-Tuning, Reward Modeling, and Reinforcement Learning

Iterative Supervised Fine-Tuning (SFT)

Reward Model Construction

Reinforcement Learning with GRPO

4. Bilingual Reasoning and Tool Integration

5. Inference Methodologies: Sampling and Answer Selection

6. Evaluation across Benchmarks

7. Sample-Efficient Fine-Tuning: Re-distillation Approach

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research