RMaxTS: Intrinsic Reward MCTS for Theorem Proving

Updated 4 November 2025

RMaxTS is an intrinsic-reward-driven variant of Monte Carlo Tree Search that incentivizes exploration through novel proof states in formal theorem proving.
The method integrates stepwise proof completion with whole-proof generation and uses a discounted UCB strategy to prioritize new, diverse proof paths.
Empirical results in DeepSeek-Prover-V1.5 demonstrate significant performance improvements over prior state-of-the-art methods on benchmarks like miniF2F and ProofNet.

RMaxTS is an intrinsic-reward-driven variant of Monte-Carlo Tree Search (MCTS) designed for automated theorem proving with LLMs, notably implemented in DeepSeek-Prover-V1.5 for Lean 4. The method departs from classical proof search strategies by incentivizing exploration through intrinsic rewards for reaching novel proof states, thereby diversifying solution paths and mitigating the extreme reward sparsity endemic to formal theorem proving. RMaxTS is architected to unify stepwise proof completion and whole-proof generation, and is tightly coupled with both RL-based model training and a feedback-driven, incremental proof search protocol in the inference phase.

1. Concept and Motivation

RMaxTS addresses a core challenge in neural theorem proving: the sparsity of extrinsic rewards (proof correctness) during search. In classical RL or search paradigms applied to proof synthesis, positive reward is typically only received when a complete, theorem-proving sequence is generated and accepted by the proof assistant—an event that is rare and hinders efficient optimization or search. RMaxTS augments MCTS with a form of "optimistic" intrinsic reward: any expansion of a node leading to a previously unseen proof state yields a positive signal, regardless of whether the sequence ultimately constitutes a full correct proof. This encourages search trajectories that venture into unexplored regions of the proof space, counteracting the tendency to repeatedly attempt similar, nearby, or trivial variants.

2. Algorithmic Formalism

At a high level, the proof search tree is organized at the tactic level, with each node corresponding to an intermediate Lean state (the current open subgoal and proof context), and edges representing extension by a model-generated proof step (tactic).

The selection, expansion, and backpropagation steps are as follows:

Selection:

The search proceeds by traversing the tree from the root, at each state $s$ , selecting the child $a$ that maximizes a UCB-style objective:

$\text{TreePolicy}(s) = \arg\max_{a \in \text{Children}(s) \cup \{\oslash\}} Q_{\text{UCB}}(s, a)$

where

$Q_{\text{UCB}}(s, a) = Q(s, a) + UCB(s, a)$

and $Q(s, a)$ is the mean value of action $a$ at node $s$ , $UCB(s, a)$ is an upper-confidence bound term.

Expansion:

Upon expansion, the model resumes proof generation from a verified proof prefix, progressing until the next verification error or proof completion.

Intrinsic Reward (RMax):

The intrinsic reward assigned to a transition $\tau$ is:

$R_{\text{intrinsic}}(\tau) = \mathbb{I}[\text{at least one new node was added to the tree}]$

i.e., reward 1 if a new proof state is created for the first time; otherwise, 0. Here, $\mathbb{I}$ is the indicator function.

Backpropagation (DUCB):

To address the non-stationarity of intrinsic reward (as new states become harder to find over the search), RMaxTS uses a discounted UCB statistic:

$Q_{\text{DUCB}}(s, a) = \frac{W_\gamma(s,a)}{N_\gamma(s,a)} + \sqrt{\frac{2\ln\sum_{a'}N_\gamma(s,a')}{N_\gamma(s,a)}}$

where $W_\gamma$ and $N_\gamma$ are exponentially discounted cumulative reward and visit counts with discount factor $\gamma=0.99$ . This increases recency sensitivity of the score estimate, favoring actions that recently yielded intrinsic reward.

Parallelization:

RMaxTS is designed to scale: multiple independent search "runners" execute in parallel, sharing the search tree and using the "virtual loss trick" to avoid redundant expansions.

3. Integration and Workflow in DeepSeek-Prover-V1.5

RMaxTS operationalizes a key innovation in DeepSeek-Prover-V1.5: unifying proof-step search with whole-proof synthesis. Proof generation is performed incrementally—upon verification failure, output is truncated and search resumes at that point. This leverages error feedback from the Lean compiler during search; the tree encodes all feasible continuation points. The search can consider both chain-of-thought (CoT) and direct tactic-prediction modes, yielding strong complementarity.

RMaxTS is used exclusively during inference—training can be generic SFT or RL, e.g., reinforcement learning from proof assistant feedback (RLPAF) using reward 1 for complete, accepted proofs and 0 otherwise, but the search strategy does not use direct reward learning signals.

4. Empirical Performance and Diagnostic Outcomes

RMaxTS is pivotal in delivering state-of-the-art results for DeepSeek-Prover-V1.5:

Benchmark	DeepSeek-Prover-V1.5 + RMaxTS	Prior SOTA (open)
miniF2F (test set)	63.5% (Pass@32×6400, cum.)	ReProver: 41.0%, GPT-4: 23.0%
ProofNet (test set)	25.3%	ReProver: 13.8%, InternLM2-StepProver: 18.1%

Exploration Efficiency:

RMaxTS results in far greater diversity of proof paths than naive sampling, vanilla MCTS, or non-intrinsic guided approaches. Intrinsic reward prevents stagnation—critical in harder problems and deep search regimes.

Complementarity of Modes:

The best aggregate results are obtained by mixing CoT (human-style, step-by-step) and non-CoT (concise tactic) modes in the tree, as different theorems respond preferentially to structured reasoning vs. direct subgoal closure.

Scaling Generalization:

As the allowed search budget increases (search width, depth, or number of samples), RMaxTS's gains scale: additional search does not just fine-tune the same family of proof strategies; it discovers qualitatively novel and harder-to-reach solutions.

5. Theoretical and Practical Significance

RMaxTS occupies an intermediate point between sparse-reward RL exploration (which is essentially intractable in deeply sparse environments like formal theorem proving) and brute-force or uniform MCTS, which rapidly become computationally infeasible. Its optimistic exploration ensures the search tree grows "wide" with respect to proof state variety, while still concentrating effort where new progress is observed. This diversity is fundamental to reliable theorem proving at scale.

A plausible implication is that intrinsic-reward-driven diversification, as exemplified by RMaxTS, is now a cornerstone for modern neural proof search—other competitive systems (e.g., Goedel-Prover-V2; see (Lin et al., 5 Aug 2025)) and hybrid methods use related ideas, but RMaxTS provides a concretely scalable and empirically validated methodology in Lean 4.

6. Limitations and Prospects

Compute Requirements:

While parallelized, RMaxTS can be demanding at large budgets (e.g., 32×6400 samples per problem), but the marginal gain persists across much of the regime. For resource-constrained deployments, hybrid search methods (e.g., ProofCompass, see (Wischermann et al., 18 Jul 2025)) show promise at reducing attempts via LLM-guided lemma decomposition.

Reliance on Proof State Deduplication:

Detection of "novelty" depends on state hashing/equality determination in Lean; imperfect hashing could lead to overstated novelty or redundant search.

Generality:

RMaxTS is implemented in DeepSeek-Prover-V1.5 for Lean 4, but in principle is applicable to any proof assistant with verified state interactivity and fine-grained tactic-level progression.

7. Summary Table: RMaxTS in the DeepSeek-Prover-V1.5 Workflow

Stage	Description
Pretraining	DeepSeekMath-Base, code & NL in formal math
Supervised Fine-Tuning (SFT)	CoT + tactic state prediction augmented proof datasets
RL from Proof Assistant Feedback	GRPO algorithm, reward = 1 (proof correct), else 0
Inference Search	Monte-Carlo Tree Search + RMaxTS intrinsic reward
Node Expansion	Continue from prefix, stop at error or proof closure
Selection	Discounted UCB incorporating intrinsic novelty count
Backpropagation	Propagate intrinsic (novelty) or extrinsic (solved) reward
Parallelism	Many runners, virtual loss for non-blocking search
Modes Supported	Chain-of-Thought and non-CoT, combined at sampling

RMaxTS, as operationalized in DeepSeek-Prover-V1.5, constitutes an empirically grounded and reproducible advance in exploration-driven proof search for LLMs, establishing strong state-of-the-art results and robustifying the pipeline for symbolic mathematical reasoning (Xin et al., 15 Aug 2024).

PDF Markdown Chat (Pro)

References (3)

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction (2025)

ProofCompass: Enhancing Specialized Provers with LLM Guidance (2025)

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search (2024)

Follow Topic

Get notified by email when new papers are published related to RMaxTS.