rStar-Math: Self-Evolving SLMs

Updated 5 November 2025

rStar-Math is a self-evolving system that trains small language models to achieve advanced mathematical reasoning using step-by-step verified chains-of-thought and Monte Carlo Tree Search.
The approach integrates code-augmented verification with a cyclic self-improvement loop where both a policy model and a process preference model co-evolve over millions of math problems.
Experimental results demonstrate significant performance improvements on benchmarks like MATH and AIME, outperforming larger closed-source models without distillation.

rStar-Math is a self-evolving system for training small LLMs (SLMs) to reach state-of-the-art mathematical reasoning performance—on par with or superior to larger, closed-source models—without leveraging distillation from stronger LLMs. The approach centers on step-by-step verified chain-of-thought data generation using Monte Carlo Tree Search (MCTS) guided by a jointly-evolved process reward model (Process Preference Model, PPM), and a self-improvement loop in which both the policy model and reward model co-evolve over multiple training rounds via millions of newly synthesized stepwise-verified mathematical solutions (Guan et al., 8 Jan 2025).

1. System Architecture and Training Pipeline

rStar-Math employs two core SLM components:

Policy SLM: Generates multi-step reasoning trajectories that blend natural language explanations with verified code segments to solve math problems.
Process Preference Model (PPM): Provides process-level scoring of reasoning steps based on their future promise and correctness, enabling high-quality step selection during trajectory rollout and as a source of reward signals.

The training protocol is an iterative self-evolution loop:

Initial Bootstrapping: A large external model (e.g., DeepSeek-Coder-236B) bootstraps the first round by generating stepwise-verified solutions; subsequent bootstrapping is performed by the strongest available internal model.
Code-augmented CoT Data Synthesis: Each problem is decomposed using MCTS, with candidate reasoning/code steps generated and only those verified by successful code execution and downstream correctness of the full trajectory retained.
PPM Training: Rather than noisy regression labels or absolute scoring, the PPM is trained via pairwise preferences—where "positive" steps are those frequently present in successful rollouts or earlier in correct solution paths, and "negative" steps fail to do so. The training loss is a Bradley-Terry expit:

$\mathcal{L}_{\mathrm{PPM}}(\theta) = -\frac{1}{4} \mathbb{E}_{(x, y^{+}, y^{-})} \log \sigma\big( r_\theta(x, y^+) - r_\theta(x, y^-) \big)$

where $r_\theta$ is the PPM's output for input $(x, y)$ , with $x$ the current reasoning context and $y$ a candidate step.

Self-Evolution: Over four rounds, the system uses the current SLM+PPM pair to generate ever-higher-quality stepwise verified solution data, retraining both components at each stage. Top-2 Q-value trajectories per problem are selected for supervised fine-tuning.

The outcome is a corpus of >3.6M stepwise-verified solutions covering over 90% of 747k diverse math problems, and SLMs and PPMs of markedly increased competence after each round.

2. Monte Carlo Tree Search and Deep Reasoning

MCTS forms the backbone of rStar-Math’s "System 2" deep reasoning, enabling exploration of the reasoning search space far beyond standard greedy or stochastic decoding. At each expansion step:

The node selection is governed by UCT (Upper Confidence bound for Trees):

$\mathrm{UCT}(s) = Q(s) + c \sqrt{\frac{\ln N_{parent(s)}}{N(s)}}$

where $Q(s)$ is the aggregate reward for the node, $N(s)$ its visitation count, and $c$ a tunable constant.

Candidate next steps are generated with code, scored by the PPM, and only high-reward candidates are further expanded.
Rewards are back-propagated through the tree so that steps recurring in successful paths are favored.

MCTS, paired with the PPM, acts as a precise optimization of the solution search, efficiently allocating compute while targeting high-likelihood correctness.

3. Code-Augmented Stepwise Verification

Each reasoning step solicited from the policy SLM is paired with an explicitly generated code segment. Solutions advance only if code execution is successful and yields an intermediate/final value consistent with the reasoning up to that point. This supervisory signal:

Reduces propagation of hallucinated or logically inconsistent steps.
Ensures that intermediate mathematical results are grounded in computational validation, not LLM confidence.

This dense, automated verification throughout the solution trajectory—unlike outcome-only correctness or random sampling—enables fine-grained supervision and curation of extremely clean training data.

4. Experimental Results and Performance

rStar-Math is evaluated on major math reasoning benchmarks, notably MATH, AIME, OlympiadBench, College Math, and GSM8K. The most salient results include:

Model	MATH	AIME-2024	OlympiadBench	College Math
Qwen2.5-Math-7B (base)	58.8%	–	–	–
Qwen2.5-Math-7B + rStar-Math	90.0%	53.3%	65.6%	60.5%
Phi3-mini-3.8B + rStar-Math	86.4%	43.3%	60.3%	59.1%
o1-preview	85.5%	44.6%	–	–

Qwen2.5-Math-7B + rStar-Math achieves a 31.2-point improvement over its baseline and surpasses o1-preview by +4.5% on MATH.
On AIME-2024, rStar-Math matches/exceeds the performance of only the very strongest proprietary models (o1-mini), corresponding to the top 20% of high school Math Olympiad contestants.
Notably, Qwen2.5-Math-1.5B + rStar-Math achieves 88.6% on MATH, indicating strong scaling even at very small parameter counts.

Ablation results show that training on code-verified trajectories dramatically outperforms data distilled solely from GPT-4-produced chains-of-thought. The PPM is a necessary ingredient: outcome-only or naive regression baselines ("Best-of-N", Q-value regression) are consistently weaker, particularly on the most challenging problems.

5. Mechanisms for Generalization and Robust Reasoning

Several design features enable robust generalization and deep error correction:

The PPM, trained on dense preference signals, identifies key mathematical structure and correctly rewards application of relevant theorems.
MCTS allows the policy SLM to backtrack and revise steps, promoting solutions that include error correction within the trajectory—behavior rare in open-weight LLMs prior to rStar-Math.
The methodology enables transfer of high reasoning accuracy to previously unseen math domains (e.g., theorem proving, code reasoning, national-specific curricula).

Scalability analyses indicate that performance saturates near 64 inference trajectories per problem, with rStar-Math consistently outperforming Best-of-N approaches even when compute budgets are matched.

6. Significance, Limitations, and Release

rStar-Math demonstrates that dense, code-executed supervision, preference-based process reward, and stepwise-verifiable data synthesis allow SLMs to match or outperform much larger proprietary models in mathematical reasoning, without distillation or external solution labels. The approach is made fully open: code and data are (will be) available at https://github.com/microsoft/rStar.

A limitation is higher inference cost due to deep MCTS rollouts, but this is mitigated by parallelization and the approach’s efficiency in difficult regimes. Further potential exists in integrating symbolic tools, automated theorem checkers, or iterative self-improvement via newly generated high-quality training data.

7. Core Algorithmic Components (Summary Table)

Component	Description
Policy SLM	Generates reasoning trajectories; trained on verified MCTS rollouts
Process Preference Model	Assigns pairwise-preference rewards to trajectory steps
MCTS	Tree search for high-quality solutions using UCT and Q-value propagation
Stepwise Verification	Filters trajectory steps by code execution and reward score
Self-Evolution	Four rounds of cyclically improving SLMs and reward models
Data Source	747k diverse problems; >3.6M code-verified solutions synthesized

rStar-Math establishes a new regime for mathematical LLMs, in which iterative self-improvement, fine-grained process supervision, and verified step-by-step reasoning allow small open models to attain and surpass the mathematical reasoning capabilities of previously leading LLMs without resorting to scale or distillation (Guan et al., 8 Jan 2025).

PDF Markdown Chat (Pro)

References (1)

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking (2025)

Follow Topic

Get notified by email when new papers are published related to rStar-Math.