Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 221 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Qwen2.5-Math-1.5B: Self-Improving Math LLM

Updated 9 August 2025

Qwen2.5-Math-1.5B is a math-specialized LLM with 1.5B parameters that uses an iterative self-improvement pipeline, advanced reward modeling, and dual reasoning modes.
It employs a three-stage training process including pre-training on a trillion-token corpus, post-training with rejection sampling, and reinforcement-guided inference optimization.
The model achieves competitive performance on diverse benchmarks in both English and Chinese by leveraging chain-of-thought and tool-integrated reasoning for robust math problem-solving.

Qwen2.5-Math-1.5B is a math-specialized LLM within the Qwen2.5-Math series, distinguished by its iterative self-improvement training pipeline, advanced reward modeling, and comprehensive evaluation on multilingual mathematical benchmarks. At only 1.5 billion parameters, it integrates stepwise chain-of-thought and tool-integrated reasoning, yielding competitive performance versus much larger models on diverse math tasks ranging from basic arithmetic to Olympiad-level competition problems (Yang et al., 18 Sep 2024).

1. Self-Improvement Pipeline and Training Architecture

Qwen2.5-Math-1.5B evolves through three interdependent stages: pre-training, post-training, and inference-time optimization.

Pre-training: The model is exposed to the Qwen Math Corpus v2, comprising over 1 trillion tokens. This corpus integrates recalled web and math documents with high-quality, synthetic mathematical questions and answers generated by the predecessor Qwen2-Math-Instruct model. This ensures broad coverage from elementary to competition-level problems.
Post-training: A reward model (RM) is iteratively trained via massive sampling from Qwen2-Math-Instruct. Candidate responses—including chain-of-thought (CoT) and tool-integrated reasoning (TIR)—are ranked and selected using rejection sampling. For problems with known answers, candidates with correct finals and high RM scores are chosen; for open-ended queries, a weighted majority vote selects the most plausible response. This process is repeated: a stronger SFT (supervised fine-tuning) model yields better data for the next RM generation, pushing the quality further (Yang et al., 18 Sep 2024).
Reward Model: The RM employs a list-wise ranking loss:

$\mathcal{L}_{rm}(\theta) = -\frac{1}{k \times (6-k)}\,\mathbb{E}_{(x,y_{pos},y_{neg}) \sim D}\left[\log\Bigl(\sigma\Bigl(r_{\theta}(x,y_{pos})-r_{\theta}(x,y_{neg})\Bigr)\Bigr)\right]$

where $r_\theta(x, y)$ is the RM output for input $(x, y)$ and $\sigma$ is the sigmoid function.

Significance: These iterations foster a closed-loop "self-improving" pipeline—the supervised fine-tuned model and its RM continuously co-evolve, amplifying reasoning skill.

2. Inference Optimization and Reinforcement Learning

During inference, optimization mechanisms draw further accuracy:

Best-of-N Sampling: Upon user query, multiple candidate answers are generated. The RM selects the one with the highest reward score, improving reliability for challenging math computations.
Reinforcement Learning with Group Relative Policy Optimization (GRPO): The RM guides policy updates such that responses with correct answers and intermediate steps achieve consistently higher overall reward. The total reward is shaped as:

$r = \sigma(\alpha \cdot r_{m}) + (r_{v} - 1)$

where $r_m$ is the RM score, $r_v$ is a sparse rule-based verifier reward, and $\alpha$ is a scaling factor (here, $0.5$).

Significance: These strategies allow the model to reward solutions with correct reasoning steps at every stage, reinforcing logically sound reasoning at both training and inference (Yang et al., 18 Sep 2024).

3. Mathematical Reasoning: Chain-of-Thought and Tool-Integrated Reasoning

Qwen2.5-Math-1.5B excels in two annotated reasoning categories:

Chain-of-Thought (CoT): The model is explicitly trained to decompose complex math problems into manageable intermediate steps (e.g., sequential calculations and logical decisions).
Tool-Integrated Reasoning (TIR): For precise calculation or symbolic manipulation (e.g., solving equations, calculating eigenvalues), the model leverages external computational tools such as a Python interpreter. Even at the 1.5B parameter scale, TIR enables the model to match or exceed the scores of much larger open-source math-specialized LLMs.

Context: This dual-mode reasoning capability, especially when combined with tool calls, is responsible for strong benchmark scores, despite the model’s compact size.

4. Benchmarking and Evaluation Across Languages

Qwen2.5-Math-1.5B is evaluated on ten mathematics benchmarks in both Chinese and English, including GSM8K, MATH, GaoKao, AMC23, and AIME24.

Dataset	Type	Observed Model Strength
GSM8K	Grade-school/English	Outperforms larger peers
MATH	Competition/English	Competitive in TIR mode
GaoKao	National Exam/Chinese	Marked performance boost
AIME24	Olympiad/English	Robust at 1.5B scale

Performance is strong across datasets: even the smallest 1.5B model in the family delivers results competitive with models an order of magnitude larger, especially when inference is augmented with TIR (Yang et al., 18 Sep 2024).

Significance: The inclusion of Chinese math datasets marks a critical improvement over earlier Qwen iterations, which neglected Chinese-specific math data. The model demonstrates stability across both easy and hard problems, consistently exploiting reasoning traces for enhanced accuracy.

5. Architectural and Algorithmic Characteristics

Architecturally, Qwen2.5-Math-1.5B is a decoder-only Transformer, benefitting from optimized pre-training corpus selection and fine-tuning strategies.

Data Mixture Design: The model leverages a carefully curated corpus that up-samples mathematical content and tool problem-solving examples, down-sampling low-value web text.
Iterative RM and SFT Loop: The co-evolution of the RM and SFT minimizes exposure to low-quality answers, refining the model’s score-driven output.
Efficient Scaling: Notably, the 1.5B model leverages high-quality training and reward-based sampling to achieve performance traditionally attained only by models with significantly more parameters.

Significance: This approach demonstrates that improvement in reasoning is not solely a function of model size—quality and diversity of data, training loop architecture, and reward-driven feedback are essential.

6. Broader Implications and Research Context

The Qwen2.5-Math-1.5B exemplifies a new paradigm where smaller, efficiently trained models can set performance standards in mathematical reasoning by leveraging self-improvement and reward modeling.

Comparative Strength: On benchmarks, Qwen2.5-Math-1.5B outperforms many much larger models, illustrating the value of iterative reward-guided evolution rather than brute-force scaling.
Multilingual Capabilities: Its support for both Chinese and English positions it as a genuinely bilingual math specialist, useful for academic, educational, and research contexts globally.
Generalization: Enhanced CoT and TIR enable transferability across a spectrum of mathematical domains—arithmetic, algebra, and competition mathematics—which is significant given the model's compact parameter count.

7. Concluding Summary

Qwen2.5-Math-1.5B is a self-improving, math-specialized LLM combining a reward-driven training process with chain-of-thought and tool-integrated reasoning. Through iterative RM-guided fine-tuning and inference-time optimization, it achieves robust, high-accuracy performance on diverse math benchmarks in both Chinese and English, demonstrating that domain-specific data and self-improvement strategies are as vital as model scale in realizing expert mathematical reasoning (Yang et al., 18 Sep 2024).

$\boxed{ \text{Qwen2.5-Math-1.5B: A self-improved, math-specialized model with enhanced CoT and TIR capabilities that achieves strong performance on diverse math benchmarks.} }$

PDF Markdown Chat (Pro)

References (1)

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement (2024)

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-Math-1.5B.