Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 168 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 181 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

RMaxTS: Intrinsic Reward MCTS for Theorem Proving

Updated 4 November 2025
  • RMaxTS is an intrinsic-reward-driven variant of Monte Carlo Tree Search that incentivizes exploration through novel proof states in formal theorem proving.
  • The method integrates stepwise proof completion with whole-proof generation and uses a discounted UCB strategy to prioritize new, diverse proof paths.
  • Empirical results in DeepSeek-Prover-V1.5 demonstrate significant performance improvements over prior state-of-the-art methods on benchmarks like miniF2F and ProofNet.

RMaxTS is an intrinsic-reward-driven variant of Monte-Carlo Tree Search (MCTS) designed for automated theorem proving with LLMs, notably implemented in DeepSeek-Prover-V1.5 for Lean 4. The method departs from classical proof search strategies by incentivizing exploration through intrinsic rewards for reaching novel proof states, thereby diversifying solution paths and mitigating the extreme reward sparsity endemic to formal theorem proving. RMaxTS is architected to unify stepwise proof completion and whole-proof generation, and is tightly coupled with both RL-based model training and a feedback-driven, incremental proof search protocol in the inference phase.

1. Concept and Motivation

RMaxTS addresses a core challenge in neural theorem proving: the sparsity of extrinsic rewards (proof correctness) during search. In classical RL or search paradigms applied to proof synthesis, positive reward is typically only received when a complete, theorem-proving sequence is generated and accepted by the proof assistant—an event that is rare and hinders efficient optimization or search. RMaxTS augments MCTS with a form of "optimistic" intrinsic reward: any expansion of a node leading to a previously unseen proof state yields a positive signal, regardless of whether the sequence ultimately constitutes a full correct proof. This encourages search trajectories that venture into unexplored regions of the proof space, counteracting the tendency to repeatedly attempt similar, nearby, or trivial variants.

2. Algorithmic Formalism

At a high level, the proof search tree is organized at the tactic level, with each node corresponding to an intermediate Lean state (the current open subgoal and proof context), and edges representing extension by a model-generated proof step (tactic).

The selection, expansion, and backpropagation steps are as follows:

  • Selection:

The search proceeds by traversing the tree from the root, at each state ss, selecting the child aa that maximizes a UCB-style objective:

TreePolicy(s)=argmaxaChildren(s){}QUCB(s,a)\text{TreePolicy}(s) = \arg\max_{a \in \text{Children}(s) \cup \{\oslash\}} Q_{\text{UCB}}(s, a)

where

QUCB(s,a)=Q(s,a)+UCB(s,a)Q_{\text{UCB}}(s, a) = Q(s, a) + UCB(s, a)

and Q(s,a)Q(s, a) is the mean value of action aa at node ss, UCB(s,a)UCB(s, a) is an upper-confidence bound term.

  • Expansion:

Upon expansion, the model resumes proof generation from a verified proof prefix, progressing until the next verification error or proof completion.

  • Intrinsic Reward (RMax):

The intrinsic reward assigned to a transition τ\tau is:

Rintrinsic(τ)=I[at least one new node was added to the tree]R_{\text{intrinsic}}(\tau) = \mathbb{I}[\text{at least one new node was added to the tree}]

i.e., reward 1 if a new proof state is created for the first time; otherwise, 0. Here, I\mathbb{I} is the indicator function.

  • Backpropagation (DUCB):

To address the non-stationarity of intrinsic reward (as new states become harder to find over the search), RMaxTS uses a discounted UCB statistic:

QDUCB(s,a)=Wγ(s,a)Nγ(s,a)+2lnaNγ(s,a)Nγ(s,a)Q_{\text{DUCB}}(s, a) = \frac{W_\gamma(s,a)}{N_\gamma(s,a)} + \sqrt{\frac{2\ln\sum_{a'}N_\gamma(s,a')}{N_\gamma(s,a)}}

where WγW_\gamma and NγN_\gamma are exponentially discounted cumulative reward and visit counts with discount factor γ=0.99\gamma=0.99. This increases recency sensitivity of the score estimate, favoring actions that recently yielded intrinsic reward.

  • Parallelization:

RMaxTS is designed to scale: multiple independent search "runners" execute in parallel, sharing the search tree and using the "virtual loss trick" to avoid redundant expansions.

3. Integration and Workflow in DeepSeek-Prover-V1.5

RMaxTS operationalizes a key innovation in DeepSeek-Prover-V1.5: unifying proof-step search with whole-proof synthesis. Proof generation is performed incrementally—upon verification failure, output is truncated and search resumes at that point. This leverages error feedback from the Lean compiler during search; the tree encodes all feasible continuation points. The search can consider both chain-of-thought (CoT) and direct tactic-prediction modes, yielding strong complementarity.

RMaxTS is used exclusively during inference—training can be generic SFT or RL, e.g., reinforcement learning from proof assistant feedback (RLPAF) using reward 1 for complete, accepted proofs and 0 otherwise, but the search strategy does not use direct reward learning signals.

4. Empirical Performance and Diagnostic Outcomes

RMaxTS is pivotal in delivering state-of-the-art results for DeepSeek-Prover-V1.5:

Benchmark DeepSeek-Prover-V1.5 + RMaxTS Prior SOTA (open)
miniF2F (test set) 63.5% (Pass@32×6400, cum.) ReProver: 41.0%, GPT-4: 23.0%
ProofNet (test set) 25.3% ReProver: 13.8%, InternLM2-StepProver: 18.1%
  • Exploration Efficiency:

RMaxTS results in far greater diversity of proof paths than naive sampling, vanilla MCTS, or non-intrinsic guided approaches. Intrinsic reward prevents stagnation—critical in harder problems and deep search regimes.

  • Complementarity of Modes:

The best aggregate results are obtained by mixing CoT (human-style, step-by-step) and non-CoT (concise tactic) modes in the tree, as different theorems respond preferentially to structured reasoning vs. direct subgoal closure.

  • Scaling Generalization:

As the allowed search budget increases (search width, depth, or number of samples), RMaxTS's gains scale: additional search does not just fine-tune the same family of proof strategies; it discovers qualitatively novel and harder-to-reach solutions.

5. Theoretical and Practical Significance

RMaxTS occupies an intermediate point between sparse-reward RL exploration (which is essentially intractable in deeply sparse environments like formal theorem proving) and brute-force or uniform MCTS, which rapidly become computationally infeasible. Its optimistic exploration ensures the search tree grows "wide" with respect to proof state variety, while still concentrating effort where new progress is observed. This diversity is fundamental to reliable theorem proving at scale.

A plausible implication is that intrinsic-reward-driven diversification, as exemplified by RMaxTS, is now a cornerstone for modern neural proof search—other competitive systems (e.g., Goedel-Prover-V2; see (Lin et al., 5 Aug 2025)) and hybrid methods use related ideas, but RMaxTS provides a concretely scalable and empirically validated methodology in Lean 4.

6. Limitations and Prospects

  • Compute Requirements:

While parallelized, RMaxTS can be demanding at large budgets (e.g., 32×6400 samples per problem), but the marginal gain persists across much of the regime. For resource-constrained deployments, hybrid search methods (e.g., ProofCompass, see (Wischermann et al., 18 Jul 2025)) show promise at reducing attempts via LLM-guided lemma decomposition.

  • Reliance on Proof State Deduplication:

Detection of "novelty" depends on state hashing/equality determination in Lean; imperfect hashing could lead to overstated novelty or redundant search.

  • Generality:

RMaxTS is implemented in DeepSeek-Prover-V1.5 for Lean 4, but in principle is applicable to any proof assistant with verified state interactivity and fine-grained tactic-level progression.

7. Summary Table: RMaxTS in the DeepSeek-Prover-V1.5 Workflow

Stage Description
Pretraining DeepSeekMath-Base, code & NL in formal math
Supervised Fine-Tuning (SFT) CoT + tactic state prediction augmented proof datasets
RL from Proof Assistant Feedback GRPO algorithm, reward = 1 (proof correct), else 0
Inference Search Monte-Carlo Tree Search + RMaxTS intrinsic reward
Node Expansion Continue from prefix, stop at error or proof closure
Selection Discounted UCB incorporating intrinsic novelty count
Backpropagation Propagate intrinsic (novelty) or extrinsic (solved) reward
Parallelism Many runners, virtual loss for non-blocking search
Modes Supported Chain-of-Thought and non-CoT, combined at sampling

RMaxTS, as operationalized in DeepSeek-Prover-V1.5, constitutes an empirically grounded and reproducible advance in exploration-driven proof search for LLMs, establishing strong state-of-the-art results and robustifying the pipeline for symbolic mathematical reasoning (Xin et al., 15 Aug 2024).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RMaxTS.