Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement (2409.12122v1)

Published 18 Sep 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In this report, we present a series of math-specific LLMs: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it's possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems.

PDF HTML Abstract

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

The technical report "Qwen2.5-Math: Toward Mathematical Expert Model via Self-Improvement" from the Qwen Team at Alibaba Group presents a notable advancement in the development of specialized LLMs for mathematical reasoning. The paper introduces a series of models: Qwen2.5-Math and Qwen2.5-Math-Instruct, available in different parameter sizes (1.5B, 7B, and 72B). This work builds upon its predecessor, Qwen2-Math, by employing a series of self-improvement techniques that enhance both training and inference processes.

Core Innovations

The primary innovation of the Qwen2.5-Math series lies in the integration of self-improvement throughout the entire pipeline:

Pre-training Phase: Utilizing Qwen2-Math-Instruct to generate large-scale, high-quality mathematical data.
Post-training Phase: Developing a reward model (RM) by conducting massive sampling, followed by iterative data enhancement in supervised fine-tuning (SFT) using this RM. The ultimate RM is then employed for reinforcement learning.
Inference Stage: Using the RM to guide sampling and optimize performance.

Model Capabilities and Evaluation

The Qwen2.5-Math-Instruct models support both Chinese and English languages and feature advanced mathematical reasoning capabilities, namely Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). The models were evaluated on ten datasets, including GSM8K, MATH, GaoKao, AMC23, and AIME24, which span a range of difficulties from grade school to competition-level problems.

Numerical Results

The flagship model, Qwen2.5-Math-72B-Instruct, demonstrated superior performance over existing open-source and leading closed-source models (such as GPT-4o and Gemini Math-Specialized 1.5 Pro). In particular, the model nearly solved all problems in the challenging AMC 2023 dataset when guided by the RM. The Qwen2.5-Math-7B-Instruct model outperformed Qwen2-Math-Instruct 72B, achieving CoT and TIR scores of 83.6 and 85.3, respectively. The smallest model, Qwen2.5-Math-1.5B-Instruct, attained a MATH score of around 80 using a Python Interpreter, surpassing many current models in this domain.

Self-Improvement Techniques

Mathematical Pre-training:
- Employed Qwen2-Math-Instruct to synthesize and enrich the training corpus.
- Developed Qwen Math Corpus v1 and v2, significantly increasing the scale from 700 billion to over 1 trillion tokens.
- Initialized models from the Qwen2.5 series base models for enhanced capabilities.
Post-training:
- Implemented supervised fine-tuning using CoT and TIR datasets.
- Conducted iterative fine-tuning with a reward model trained on diverse mathematical problems and responses.
Reward Model Training:
- Synthesized preference data for supervised training.
- Employed Group Relative Policy Optimization (GRPO) for reinforcement learning without additional value function approximation.

Practical and Theoretical Implications

The Qwen2.5-Math models significantly advance the field of mathematical problem-solving in AI. Practically, these models can be utilized in educational tools, automated theorem proving, and research where mathematical computations are essential. Theoretically, the success of self-improvement techniques highlights the potential of iterative fine-tuning guided by reward models extended to other domains.

Future Developments

Anticipations include further refining the self-improvement loop with more sophisticated reward models and exploring its application beyond mathematics. Future work may focus on reducing model size while maintaining performance, enhancing language transfer capabilities, and integrating additional external tools for more precise reasoning.

Conclusion

The Qwen2.5-Math series represents a significant enhancement in the capabilities of LLMs tailored for mathematical reasoning. By synthesizing data and iteratively improving models with robust reward-guided training, Qwen2.5-Math sets a new benchmark in this specialized domain, showing promise for broader applications in AI-powered mathematical problem-solving.

The models, along with the datasets and evaluation scripts, are made available on platforms such as Hugging Face and GitHub, ensuring that the research community can access and build upon these advancements. The Qwen Team's contributions mark a pertinent step toward refining LLMs for domain-specific tasks through systematic self-improvement methodologies.