Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking (2501.04519v1)

Published 8 Jan 2025 in cs.CL

Abstract: We present rStar-Math to demonstrate that small LLMs (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

The paper introduces a self-evolutionary framework for mathematical reasoning in small LLMs (SLMs), demonstrating that with carefully designed search and reward techniques, models with as few as 1.5–7 billion parameters can achieve performance that rivals or even exceeds that of much larger systems. The methodology centers on combining Monte Carlo Tree Search (MCTS) with a code-augmented chain-of-thought (CoT) generation process and a novel process preference model (PPM) to iteratively improve the quality of generated reasoning trajectories.

Key Contributions and Methodology

  • Code-Augmented CoT Synthesis:

The approach augments standard natural language chain-of-thought reasoning with corresponding Python code snippets. For each single-step reasoning output, the SLM produces both a natural language explanation (embedded as a comment) and executable code. Successful execution serves as a verification step that filters out spurious or hallucinated intermediate steps, ensuring that only trajectories with correctly computed sub-results are used for training.

  • Monte Carlo Tree Search for Stepwise Exploration:
    • Q(s)Q(s) is formed from the cumulative process reward,
    • N(s)N(s) denotes the visit count of node ss,
    • cc is a tunable constant balancing exploration and exploitation.
    • This stepwise exploration not only guides the search towards promising reasoning paths but also generates rich annotated data for subsequent training rounds.
  • Process Preference Model (PPM):
    • rθ(x,y)r_\theta(x,y) represents the scalar reward prediction from the PPM for a trajectory,
    • σ\sigma is the sigmoid function,
    • and KK is a normalization constant.
    • This design circumvents the imprecision inherent in step-level scalar annotations and provides denser, pairwise supervision.
  • Self-Evolution Through Multi-Round Training:

The system is iteratively improved over four rounds. In each round, the SLM and PPM are refined using increasingly challenging math problems. During initial rounds, a terminal-guided annotation method is employed—rewarding terminal nodes with a binary score (correct answer: 1, incorrect: −1) and backpropagating this signal through earlier steps. In later rounds, the improved PPM is integrated to initialize the Q-value estimates, enabling a more reliable update during MCTS back-propagation. This iterative self-improvement not only expands the quality and diversity of generated trajectories but also allows the SLM to progressively handle problems from grade-school to Olympiad levels.

Experimental Results and Analysis

  • Performance Gains on Standard Benchmarks:

The paper evaluates the framework on diverse benchmarks including MATH, AIME 2024, AMC 2023, Olympiad Bench, College Math, GSM8K, and Gaokao En. For instance, one of the math-specialized base models improved its Pass@1 accuracy on the MATH benchmark from approximately 58.8% to 90.0% after applying {rStar-Math}. Comparable improvements are observed across other tasks, with the system ranking among the top-tier on competition-level tasks—solving around 53.3% of AIME problems, which corresponds to performance comparable to a top 20% high school math cohort.

  • Impact of Test-Time Compute:

By scaling the number of MCTS rollouts (with experiments reported using up to 64 trajectories), the framework shows that increased test-time compute results in further accuracy gains. However, the improvements saturate on certain benchmarks (e.g., MATH and Olympiad Bench) while continuing to yield benefit on others (e.g., College Math). This indicates that the system effectively balances the tradeoff between deep sample exploration and computational cost.

  • Ablation Studies and Comparative Analyses:
    • The use of code execution as a verification mechanism provides denser supervision and outperforms baselines that rely solely on random or rejection sampling of reasoning trajectories.
    • The PPM trained via pairwise ranking consistently outperforms alternatives that use either best-of-N selection with traditional outcome reward models or models trained directly on noisy Q-values using regression losses.
    • Iterative self-evolution is critical, with each round delivering significant incremental improvements, highlighting the efficacy of reinforcing early reasoning steps through retrospective evaluation.
  • Emergence of Self-Reflection:

A particularly interesting observation from qualitative analysis is that the system begins to exhibit intrinsic self-reflection. In some search trajectories, the policy model, without any explicit self-reflection training, appears to backtrack when encountering low-quality reasoning steps and subsequently generates an alternative solution path. This emergent behavior suggests that the integration of MCTS with step-level verification encourages internal error detection and dynamic planning.

Concluding Remarks

The paper provides a comprehensive paper on how relatively small LLMs can be enhanced through a reinforcement-inspired, self-evolutionary framework. By leveraging MCTS for systematic exploration, integrating a robust code-augmented verification mechanism, and training a dedicated process preference model, the approach sets a new state-of-the-art in mathematical reasoning across multiple benchmarks. The methodology underscores the importance of explicit stepwise verification and preference-based supervision in complex reasoning tasks, paving the way for future research in making systematic deep reasoning approaches more accessible and efficient in SLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xinyu Guan (10 papers)
  2. Li Lyna Zhang (20 papers)
  3. Yifei Liu (43 papers)
  4. Ning Shang (8 papers)
  5. Youran Sun (10 papers)
  6. Yi Zhu (233 papers)
  7. Fan Yang (878 papers)
  8. Mao Yang (62 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com