The paper introduces a self-evolutionary framework for mathematical reasoning in small LLMs (SLMs), demonstrating that with carefully designed search and reward techniques, models with as few as 1.5–7 billion parameters can achieve performance that rivals or even exceeds that of much larger systems. The methodology centers on combining Monte Carlo Tree Search (MCTS) with a code-augmented chain-of-thought (CoT) generation process and a novel process preference model (PPM) to iteratively improve the quality of generated reasoning trajectories.
Key Contributions and Methodology
- Code-Augmented CoT Synthesis:
The approach augments standard natural language chain-of-thought reasoning with corresponding Python code snippets. For each single-step reasoning output, the SLM produces both a natural language explanation (embedded as a comment) and executable code. Successful execution serves as a verification step that filters out spurious or hallucinated intermediate steps, ensuring that only trajectories with correctly computed sub-results are used for training.
- Monte Carlo Tree Search for Stepwise Exploration:
- is formed from the cumulative process reward,
- denotes the visit count of node ,
- is a tunable constant balancing exploration and exploitation.
- This stepwise exploration not only guides the search towards promising reasoning paths but also generates rich annotated data for subsequent training rounds.
- Process Preference Model (PPM):
- represents the scalar reward prediction from the PPM for a trajectory,
- is the sigmoid function,
- and is a normalization constant.
- This design circumvents the imprecision inherent in step-level scalar annotations and provides denser, pairwise supervision.
- Self-Evolution Through Multi-Round Training:
The system is iteratively improved over four rounds. In each round, the SLM and PPM are refined using increasingly challenging math problems. During initial rounds, a terminal-guided annotation method is employed—rewarding terminal nodes with a binary score (correct answer: 1, incorrect: −1) and backpropagating this signal through earlier steps. In later rounds, the improved PPM is integrated to initialize the Q-value estimates, enabling a more reliable update during MCTS back-propagation. This iterative self-improvement not only expands the quality and diversity of generated trajectories but also allows the SLM to progressively handle problems from grade-school to Olympiad levels.
Experimental Results and Analysis
- Performance Gains on Standard Benchmarks:
The paper evaluates the framework on diverse benchmarks including MATH, AIME 2024, AMC 2023, Olympiad Bench, College Math, GSM8K, and Gaokao En. For instance, one of the math-specialized base models improved its Pass@1 accuracy on the MATH benchmark from approximately 58.8% to 90.0% after applying {rStar-Math}. Comparable improvements are observed across other tasks, with the system ranking among the top-tier on competition-level tasks—solving around 53.3% of AIME problems, which corresponds to performance comparable to a top 20% high school math cohort.
- Impact of Test-Time Compute:
By scaling the number of MCTS rollouts (with experiments reported using up to 64 trajectories), the framework shows that increased test-time compute results in further accuracy gains. However, the improvements saturate on certain benchmarks (e.g., MATH and Olympiad Bench) while continuing to yield benefit on others (e.g., College Math). This indicates that the system effectively balances the tradeoff between deep sample exploration and computational cost.
- Ablation Studies and Comparative Analyses:
- The use of code execution as a verification mechanism provides denser supervision and outperforms baselines that rely solely on random or rejection sampling of reasoning trajectories.
- The PPM trained via pairwise ranking consistently outperforms alternatives that use either best-of-N selection with traditional outcome reward models or models trained directly on noisy Q-values using regression losses.
- Iterative self-evolution is critical, with each round delivering significant incremental improvements, highlighting the efficacy of reinforcing early reasoning steps through retrospective evaluation.
- Emergence of Self-Reflection:
A particularly interesting observation from qualitative analysis is that the system begins to exhibit intrinsic self-reflection. In some search trajectories, the policy model, without any explicit self-reflection training, appears to backtrack when encountering low-quality reasoning steps and subsequently generates an alternative solution path. This emergent behavior suggests that the integration of MCTS with step-level verification encourages internal error detection and dynamic planning.
Concluding Remarks
The paper provides a comprehensive paper on how relatively small LLMs can be enhanced through a reinforcement-inspired, self-evolutionary framework. By leveraging MCTS for systematic exploration, integrating a robust code-augmented verification mechanism, and training a dedicated process preference model, the approach sets a new state-of-the-art in mathematical reasoning across multiple benchmarks. The methodology underscores the importance of explicit stepwise verification and preference-based supervision in complex reasoning tasks, paving the way for future research in making systematic deep reasoning approaches more accessible and efficient in SLMs.