rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Published 8 Jan 2025 in cs.CL | (2501.04519v1)

Abstract: We present rStar-Math to demonstrate that small LLMs (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel self-evolution framework that uses Monte Carlo Tree Search and code-augmented chain-of-thought to improve math reasoning in small LLMs.
It introduces a Process Preference Model that refines reasoning steps through pairwise ranking, leading to higher solution accuracy on benchmarks.
Experimental results show significant performance boosts, with improvements up to 90% on the MATH benchmark and superior results on Olympiad-level tests.

rStar-Math: Small LLMs Excel in Math Reasoning Through Self-Evolved Deep Thinking

The paper "rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking" (2501.04519) introduces rStar-Math, a novel approach that enables small LLMs (SLMs) to achieve state-of-the-art performance in mathematical reasoning. This is accomplished through a self-evolutionary process leveraging Monte Carlo Tree Search (MCTS) for "deep thinking," guided by an SLM-based process reward model (PRM). The key innovations include code-augmented CoT data synthesis, a process preference model (PPM) training method, and a self-evolution recipe for iteratively improving reasoning capabilities.

Key Methodological Innovations

The rStar-Math framework hinges on three primary innovations designed to overcome the challenges of training SLMs for complex mathematical reasoning.

Code-Augmented CoT Data Synthesis

This method addresses the issue of unreliable intermediate steps in generated reasoning trajectories. It leverages MCTS to decompose problem-solving into multi-step generation. At each step, the policy SLM samples candidate nodes, generating both a one-step CoT and corresponding Python code (Figure 1). Only nodes with successful code execution are retained, ensuring the correctness of intermediate steps. Furthermore, extensive MCTS rollouts assign a Q-value to each step based on its contribution to trajectories leading to the correct answer, thereby identifying high-quality reasoning steps.

Figure 1: An illustration of how the policy model generates a one-step NL CoT alongside its corresponding Python code.

Process Preference Model (PPM) Training

Training a reliable PRM for math reasoning has been an open question. Instead of directly using noisy Q-values as reward labels, the PPM leverages Q-values to distinguish between positive (correct) and negative (incorrect) steps. This method constructs preference pairs for each step based on Q-values and uses a pairwise ranking loss to optimize PPM's score prediction for each reasoning step, achieving more reliable labeling.

Self-Evolution Recipe

rStar-Math employs a four-round self-evolution recipe to progressively build both a frontier policy model and PPM from scratch (Figure 2). In each round, the latest policy model and PPM are used to perform MCTS, generating increasingly high-quality training data. This iterative process refines the policy SLM, improves the PPM's reliability, generates better reasoning trajectories via PPM-augmented MCTS, and expands training data coverage to tackle more challenging problems.

Figure 2: A visual representation of the rStar-Math framework showing a system diagram.

Experimental Results and Analysis

Performance Benchmarks

Extensive experiments across four SLMs (1.5B-7B) and seven math reasoning tasks demonstrate the effectiveness of rStar-Math. Notably, rStar-Math improves all four SLMs, matching or even surpassing OpenAI o1 on challenging math benchmarks. On the MATH benchmark, rStar-Math boosts Qwen2.5-Math-7B from 58.8\% to 90.0\% and Phi3-mini-3.8B from 41.4\% to 86.4\%, outperforming o1-preview by +4.5\% and +0.9\%. On the Olympiad-level AIME 2024, rStar-Math solves an average of 53.3\% (8/15) of the problems, exceeding o1-preview by 8.7\% and all other open-sourced LLMs (Table 1).

Ablation Studies

Ablation studies validate the superiority of step-by-step verified reasoning trajectories over state-of-the-art data synthesis baselines, as well as the PPM's effectiveness compared to outcome reward models and Q-value-based PRMs. These studies confirm that the three proposed innovations contribute significantly to the overall performance of rStar-Math.

Intrinsic Self-Reflection

One unexpected finding is the emergence of intrinsic self-reflection capability within the MCTS-driven deep thinking process (Figure 3). The model can recognize errors and self-correct with a correct answer, even without explicit self-reflection training data or prompts. This suggests that advanced System 2 reasoning can foster intrinsic self-reflection.

Figure 3: An example illustrating rStar-Math's ability to recognize low-quality reasoning steps and backtrack to simpler approaches.

Scaling Test-Time Compute

Increasing test-time computation improves reasoning accuracy across all benchmarks, though with varying trends (Figure 4). On Math, AIME, and Olympiad Bench, rStar-Math shows saturation or slow improvement at 64 trajectories, while on College Math, performance continues to improve steadily.

Figure 4: Showing the trends in reasoning performance when scaling up the test-time compute.

PPM's Influence

Experiments indicate that the PPM is the key determinant of the upper performance limit in System 2 deep thinking. The PPM shapes the reasoning boundary in System 2 deep thinking (Figure 5), effectively identifying critical theorem-application intermediate steps within the policy model's deep thinking process. These steps are predicted with high reward scores, guiding the policy model to generate the correct solution.

Figure 5: A visual showing that reward models primarily determine the final performance.

Implications and Future Directions

The rStar-Math framework presents a compelling approach for enhancing the mathematical reasoning capabilities of SLMs, achieving performance comparable to or even surpassing larger models like OpenAI's o1. The innovations in data synthesis, reward modeling, and self-evolution offer valuable insights for training SLMs in other complex reasoning tasks. Future research could explore generalizing rStar-Math to other domains, such as code and commonsense reasoning, and further improving its ability to solve challenging math problems by incorporating external knowledge and tools.

Conclusion

rStar-Math demonstrates that small LLMs can achieve state-of-the-art performance in mathematical reasoning through a self-evolved deep thinking approach. The paper's key contributions include a novel code-augmented CoT data synthesis method, a process preference model training approach, and a self-evolution recipe for iteratively improving reasoning capabilities. The experimental results and analysis highlight the effectiveness of rStar-Math and provide valuable insights for training SLMs in complex reasoning tasks.

Markdown