Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
The technical report "Qwen2.5-Math: Toward Mathematical Expert Model via Self-Improvement" from the Qwen Team at Alibaba Group presents a notable advancement in the development of specialized LLMs for mathematical reasoning. The paper introduces a series of models: Qwen2.5-Math and Qwen2.5-Math-Instruct, available in different parameter sizes (1.5B, 7B, and 72B). This work builds upon its predecessor, Qwen2-Math, by employing a series of self-improvement techniques that enhance both training and inference processes.
Core Innovations
The primary innovation of the Qwen2.5-Math series lies in the integration of self-improvement throughout the entire pipeline:
- Pre-training Phase: Utilizing Qwen2-Math-Instruct to generate large-scale, high-quality mathematical data.
- Post-training Phase: Developing a reward model (RM) by conducting massive sampling, followed by iterative data enhancement in supervised fine-tuning (SFT) using this RM. The ultimate RM is then employed for reinforcement learning.
- Inference Stage: Using the RM to guide sampling and optimize performance.
Model Capabilities and Evaluation
The Qwen2.5-Math-Instruct models support both Chinese and English languages and feature advanced mathematical reasoning capabilities, namely Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). The models were evaluated on ten datasets, including GSM8K, MATH, GaoKao, AMC23, and AIME24, which span a range of difficulties from grade school to competition-level problems.
Numerical Results
The flagship model, Qwen2.5-Math-72B-Instruct, demonstrated superior performance over existing open-source and leading closed-source models (such as GPT-4o and Gemini Math-Specialized 1.5 Pro). In particular, the model nearly solved all problems in the challenging AMC 2023 dataset when guided by the RM. The Qwen2.5-Math-7B-Instruct model outperformed Qwen2-Math-Instruct 72B, achieving CoT and TIR scores of 83.6 and 85.3, respectively. The smallest model, Qwen2.5-Math-1.5B-Instruct, attained a MATH score of around 80 using a Python Interpreter, surpassing many current models in this domain.
Self-Improvement Techniques
- Mathematical Pre-training:
- Employed Qwen2-Math-Instruct to synthesize and enrich the training corpus.
- Developed Qwen Math Corpus v1 and v2, significantly increasing the scale from 700 billion to over 1 trillion tokens.
- Initialized models from the Qwen2.5 series base models for enhanced capabilities.
- Post-training:
- Implemented supervised fine-tuning using CoT and TIR datasets.
- Conducted iterative fine-tuning with a reward model trained on diverse mathematical problems and responses.
- Reward Model Training:
- Synthesized preference data for supervised training.
- Employed Group Relative Policy Optimization (GRPO) for reinforcement learning without additional value function approximation.
Practical and Theoretical Implications
The Qwen2.5-Math models significantly advance the field of mathematical problem-solving in AI. Practically, these models can be utilized in educational tools, automated theorem proving, and research where mathematical computations are essential. Theoretically, the success of self-improvement techniques highlights the potential of iterative fine-tuning guided by reward models extended to other domains.
Future Developments
Anticipations include further refining the self-improvement loop with more sophisticated reward models and exploring its application beyond mathematics. Future work may focus on reducing model size while maintaining performance, enhancing language transfer capabilities, and integrating additional external tools for more precise reasoning.
Conclusion
The Qwen2.5-Math series represents a significant enhancement in the capabilities of LLMs tailored for mathematical reasoning. By synthesizing data and iteratively improving models with robust reward-guided training, Qwen2.5-Math sets a new benchmark in this specialized domain, showing promise for broader applications in AI-powered mathematical problem-solving.
The models, along with the datasets and evaluation scripts, are made available on platforms such as Hugging Face and GitHub, ensuring that the research community can access and build upon these advancements. The Qwen Team's contributions mark a pertinent step toward refining LLMs for domain-specific tasks through systematic self-improvement methodologies.