Qwen2.5-Math-1.5B Model
- Qwen2.5-Math-1.5B is a math-specialized large language model built on a 1.5B parameter dense Transformer, supporting both English and Chinese.
- It integrates a self-improvement pipeline comprising data-driven supervised fine-tuning, reward modeling, and reinforcement learning for enhanced accuracy.
- Evaluations on major math benchmarks show state-of-the-art performance, with best-of-N sampling and tool-integrated reasoning enabling robust mathematical insights.
The Qwen2.5-Math-1.5B model is a math-specialized, open-source LLM derived from the Qwen2.5 series, designed for advanced mathematical reasoning tasks in both English and Chinese. Built on a dense 1.5 billion parameter Transformer backbone, it integrates a self-improvement paradigm across data generation, supervised fine-tuning, reward modeling, and reinforcement learning. This architecture and methodology facilitate state-of-the-art performance for its size on a wide array of mathematical challenge benchmarks, bridging the gap between inference efficiency and mathematical expressiveness (Qwen et al., 2024, Yang et al., 2024, Chen et al., 23 May 2025).
1. Model Architecture and Parameterization
Qwen2.5-Math-1.5B adheres to the dense decoder-only Transformer blueprint established by the Qwen2.5 series, implemented without mixture-of-experts or adapter modules. Initialization is from the Qwen2.5-1.5B base, with slight variations in architectural hyperparameters reported across sources. The canonical configuration given in the Qwen2.5-Math technical report includes:
- 24 transformer decoder layers
- Hidden dimension: 2048
- Attention heads: 16
- MLP inner dimension: 8192
- Rotary positional embeddings (RoPE) for long-context scaling
- Total parameters: ≈1.5 billion
The Qwen2.5 technical report describes the base as 28 layers with Grouped Query Attention (12 query, 2 key/value heads), 32k context length, and tied embeddings. In both accounts, there are no model-specific architectural changes tailored for math; instead, mathematical proficiency arises from data and training pipeline design (Qwen et al., 2024, Yang et al., 2024).
2. Training Corpus, Data Generation, and Pre-training
The pre-training pipeline leverages curated and synthetic data emphasizing mathematical reasoning. Principal data sources and synthesis steps are:
- Use of Qwen2-Math-Instruct-72B to generate a 700B token "Qwen Math Corpus v1," combining web-scraped mathematics, code snippets, exam questions, with filtering via FastText classifiers and small LMs.
- Expansion to "Qwen Math Corpus v2" >1T tokens: additional Chinese and English problem aggregation, synthetic CoT and TIR examples, and domain mixing (web, code, encyclopedias, competitions).
- Rigorous decontamination: 13-gram and LCS filtering to avoid test set leakage.
Pre-training is accomplished using a standard autoregressive language modeling objective,
with a maximum context length of 4096 tokens, AdamW optimization, and phased learning rate decay (Yang et al., 2024).
3. Self-Improvement: Fine-Tuning, Reward Modeling, and Reinforcement Learning
Iterative Supervised Fine-Tuning (SFT)
The SFT phase deploys a large-scale, chain-of-thought-enriched corpus:
- 2,000K English and 500K Chinese CoT problems (annotated + MuggleMath synthesized), each with multiple candidate reasoning chains, selected by reward scoring.
- Approximately 190K annotated + 205K synthesized TIR problems, several with executed code steps and verifiable outputs; 75K translated into Chinese.
- SFT is performed for 3 epochs with sequence length 4096, initial learning rate 2×10⁻⁵, batch size 128 decaying to 7×10⁻⁷ (Yang et al., 2024).
Reward Model Construction
- Training set: 361K English + 257K Chinese problems, each with 6 candidate solutions from intermediate SFT models, labeled as correct/incorrect by final answer match.
- Architecture matches the SFT model backbone, adding a scalar head.
- Pairwise listwise ranking loss:
where is the number of positive candidates (Yang et al., 2024).
Reinforcement Learning with GRPO
- GRPO (Group Relative Policy Optimization) is used, with query eligibility based on response correctness counts, and reward shaping combining the RM score and a binary sparse verifier.
- Hyperparameters include a KL divergence coefficient of 1e–3, sampling 32 responses per query, global batch size 512, and learning rate 1×10⁻⁵.
4. Bilingual Reasoning and Tool Integration
Qwen2.5-Math-1.5B is natively bilingual, supporting both English and Chinese mathematical queries at all stages—data generation, SFT, RM training, and evaluation. Prompt templates are designed for each language. Tool-Integrated Reasoning (TIR) is enabled via structured prompts that elicit Python code, with an inference pipeline executing generated code and masking execution artifacts. This dual-mode (chain-of-thought and tool reasoning) setting allows advanced mathematical insight with automated verification or computation (Yang et al., 2024).
5. Inference Methodologies: Sampling and Answer Selection
Inference incorporates advanced sampling and selection approaches to optimize correctness:
- Best-of-N Sampling: Generate candidate solutions per prompt.
- Maj@N: Choose the final answer mode.
- RM@N: Choose the solution with the highest RM score.
Chain-of-thought (CoT) prompting—typically with a zero-shot "Let’s think step by step" cue—substantially augments solution quality across datasets. TIR mode further boosts accuracy on code-executable problems (Yang et al., 2024).
6. Evaluation across Benchmarks
Qwen2.5-Math-1.5B demonstrates strong results on major English and Chinese math benchmarks, setting new open-source records at its scale for several tasks. Key results under zero-shot CoT (pass@1) or best-of-N inference:
| Benchmark | Greedy | Maj@8 | RM@8 |
|---|---|---|---|
| GSM8K (English) | 84.8% | 89.5% | 94.1% |
| MATH (English) | 75.8% | 80.3% | 83.9% |
| GaoKao 2023 En (Eng) | 65.5% | 68.8% | 73.0% |
| OlympiadBench | 38.1% | 43.9% | 47.3% |
| CollegeMath | 47.7% | 48.9% | 50.2% |
| MMLU-STEM | 57.5% | 59.5% | 65.2% |
| AIME 2024 (best@256 TIR) | 19/30 | ||
| AMC 2023 (best@256 TIR) | 36/40 |
For Chinese data:
- GaoKao Math: 62.4% (greedy), 67.5% (RM@8)
- CMATH: 89.7% (greedy), 94.0% (RM@8)
- CN Middle School 24: 76.2% (greedy), 80.2% (RM@8)
The 1.5B model surpasses previous open-source LLMs below 7B parameters and approaches the performance of larger 70B models under best-of-N selection. RM scoring consistently outperforms simple majority voting, especially at higher (Yang et al., 2024).
7. Sample-Efficient Fine-Tuning: Re-distillation Approach
Recent analysis has demonstrated that with as few as ≈500 high-effect chain-of-thought solutions distilled from a well-trained RL policy (obtained via GRPO), Qwen2.5-1.5B can recover MATH pass@1 accuracy essentially matching models trained with full-scale RL. The re-distillation procedure involves:
- Training RL policy on MATH via GRPO.
- Generating ≈2000 RL rollouts; filtering and sampling 496 correct, formatted trajectories.
- Fine-tuning the base model for 2 epochs with standard cross-entropy.
- Zero-shot evaluation yields 54.4% pass@1 on MATH-500, comparable to Instruct/RL-tuned baselines.
Empirical findings confirm the "sample-effect" theoretical framework: high-effect samples (as per RL optimization) rapidly align test-accuracy gradients, making small-sample SFT approximately as effective as full RL for downstream accuracy. This suggests a practical and efficient route for adapting models to mathematical reasoning tasks, with provable gradient-alignment benefits (Chen et al., 23 May 2025).
References
- Qwen2.5 Technical Report (Qwen et al., 2024)
- Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement (Yang et al., 2024)
- Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning (Chen et al., 23 May 2025)