DeepSeek-Prover V1.5 RL

Updated 30 June 2025

DeepSeek-Prover-V1.5-RL is a cutting-edge automated theorem prover for Lean 4 that combines a 7B-parameter transformer with reinforcement learning and Monte Carlo tree search.
It utilizes explicit Lean 4 tactic-state conditioning and a hybrid proof generation framework to optimize context-aware proof construction and solution diversity.
Empirical benchmarks on miniF2F and ProofNet confirm its record-high performance and sample efficiency, setting a new standard in neural-symbolic reasoning.

DeepSeek-Prover-V1.5-RL is a state-of-the-art open-source automated theorem proving (ATP) system designed for Lean 4, distinguished by its integration of reinforcement learning from proof assistant feedback and exploration-driven Monte-Carlo tree search. Developed as the successor to DeepSeek-Prover-V1, it exemplifies modern trends in neural-symbolic reasoning: combining large-scale language modeling, interactive proof validation, and sample-efficient search. DeepSeek-Prover-V1.5-RL achieves record-high formal proof generation accuracy on standard mathematical benchmarks and introduces architectural, training, and search innovations now influencing the broader ATP and LLM-for-reasoning ecosystem.

1. Model and System Architecture

DeepSeek-Prover-V1.5-RL employs a 7B-parameter transformer backbone built upon DeepSeekMath-Base, further pre-trained on formal mathematics corpora encompassing code and natural language mathematics (Lean, Isabelle, Metamath). The architecture supports both classic single-pass (whole-proof generation) and stepwise (truncate-and-resume) generation modes. A key innovation is conditioning the model on explicit Lean 4 tactic state annotations: each proof generation step includes the internal tactic state, facilitating context-aware next-step prediction and enabling seamless interaction with tree search algorithms.

Compared to prior versions, V1.5 introduces a hybrid proof generation framework that integrates the model’s outputs with Monte-Carlo Tree Search (MCTS), thus supporting iterative deepening, performance scaling, and direct exploration of the Lean proof space.

2. Training Pipeline and Reinforcement Learning from Proof Assistant Feedback

The training regimen for DeepSeek-Prover-V1.5-RL incorporates three critical phases:

Supervised Fine-Tuning (SFT): The model is fine-tuned on a curated dataset (~9.6 million examples) of Lean proofs collected from multiple sources, augmented with chain-of-thought (CoT) comments and explicit tactic state annotations. During SFT, each training sample involves producing the next proof step contingent on the current Lean tactic state context.
Reinforcement Learning from Proof Assistant Feedback (RLPAF): A subset of theorems with intermediate model success rates is used for RL post-training. For each input, the system samples multiple full proof attempts; each attempt is then formally verified by Lean 4. Only complete, successfully verified proofs yield positive rewards, making the reward signal sparse.
Optimization Strategy: The system utilizes Group Relative Policy Optimization (GRPO), which eschews an explicit critic by normalizing the rewards within each batch/group and computes per-token advantages accordingly. The training objective includes a clipped surrogate policy gradient with a KL divergence penalty to the SFT reference model. GRPO is shown to be particularly effective for tasks with binary, sparse terminal rewards, as in theorem proving.

The reward function is defined as

$r = \begin{cases} 1 & \text{if the proof passes the Lean verifier} \ 0 & \text{otherwise} \end{cases}$

with the group-normalized advantage applied per token.

3. Exploration via RMaxTS: Intrinsic Motivation and Tree Search

Because proof rewards are extremely sparse, DeepSeek-Prover-V1.5-RL introduces RMaxTS—a variant of Monte-Carlo Tree Search (MCTS) designed for Lean ATP environments. RMaxTS augments standard MCTS with an intrinsic exploration reward: when a search trajectory visits a previously unseen Lean tactic state, it receives a positive intrinsic reward. The effective reward per trajectory is the maximum of the extrinsic (proof success) and intrinsic (novelty) reward.

Tree search action selection uses a Discounted Upper Confidence Bound (DUCB), with rewards weighted by recency to address the non-stationarity of tactic discovery: $Q_{\text{DUCB}}(s, a) = \frac{W_\gamma(s, a)}{N_\gamma(s, a)} + \sqrt{\frac{2 \ln \sum_{a'} N_\gamma(s, a')}{N_\gamma(s, a)}}$ where old explorations decay with $\gamma$ (e.g., $\gamma=0.99$ ).

This approach enhances search diversity and enables the model to discover non-trivial proof strategies that would be missed by greedy sampling or classic tree search.

4. Empirical Performance and Benchmark Results

DeepSeek-Prover-V1.5-RL advances the state of the art on multiple formal mathematics benchmarks:

miniF2F (Lean 4): Achieves 63.5% pass@N (tree search, N=16×6400), a substantial improvement over DeepSeek-Prover-V1 (50.0%) and other open 7B models.
ProofNet: Achieves 25.3% pass@N, surpassing previous open-source baselines.
Sample Robustness: Gains are stable across a broad range of sample budgets and further improved by hybridizing CoT and non-CoT variants.
Diversity: RMaxTS and tactic-state guidance increase the diversity of generated proofs, improving the probability of solution discovery in multi-sample regimes.

These results establish DeepSeek-Prover-V1.5-RL as the leading open Lean ATP as of 2024–2025, with performance gains attributed to both RL fine-tuning and exploration-enhanced search.

5. Algorithmic Innovations and Technical Details

DeepSeek-Prover-V1.5-RL integrates several technical advances:

Tactic-State Conditioning: The model’s context window is regularly refreshed with Lean’s own proof state output, enabling more reliable continuation and interaction.
Hybrid Proof Generation: Efficient batching and prompt composition allow for both whole-proof generation and interactive step-based progression, suitable for both autonomous and tool-assisted ATP workflows.
Efficient RL Training: Large batch sizes (2048+), group-based advantage normalization, and conservative KL regularization enhance both learning stability and sample efficiency. The typical group size per batch during RL is 32.
Parallelized Proof Checking: RLPAF and MCTS training/inference are run with parallel Lean 4 proof checking, distributing the computational burden over large CPU clusters, while inference remains GPU-accelerated.
Intrinsic Reward Algorithm: The combination of extrinsic (proof completion) and intrinsic (novel tactic state) rewards is implemented via binary indicators and recency-discounted UCB calculations.

6. Limitations, Controversies, and Open Challenges

Recent reviews of RL post-training in LLMs highlight two structural assumptions underlying Group Relative Policy Optimization in DeepSeek-Prover-V1.5-RL: states are treated as full token histories, and reward is distributed uniformly across the trajectory. This setup leads to a degenerate MDP where RL with GRPO becomes mathematically similar to filtered supervised fine-tuning using external verification, hence potentially obviating the need for full RL machinery (Samineni et al., 19 May 2025). Additionally, standard GRPO is vulnerable to "rank bias," where already likely (and thus repeatedly generated) solutions are disproportionately reinforced, reducing output diversity for large N in pass@N evaluations.

Mitigation strategies such as the unlikeliness reward—downweighting highly probable correct outputs and upweighting rare ones—have been shown to restore sample diversity and improve multi-sample discovery rates (He et al., 3 Jun 2025). Increasing the number of PPO epochs per batch can also alleviate some bias but incurs extra computational cost.

One implication is that future RL algorithms in formal theorem proving may evolve toward inference-aware, diversity-maximizing reward structures, improving coverage and practical usability, particularly for competitive ATP tasks requiring high pass@N.

7. Impact and Future Directions

DeepSeek-Prover-V1.5-RL’s architectural and algorithmic advances have broad implications:

Sample-Efficient Automated Reasoning: Its integration of RL from proof assistant feedback and MCTS yields systems that surpass previous accuracy ceilings at modest model scales.
Standardization of RL in Formal Reasoning: The pipeline—combining SFT, RLPAF, and exploration-guided tree search—provides a reproducible open-source framework now serving as the benchmark for subsequent work.
Open Research Directions: Areas of active investigation include training critic models for partial proofs; designing reward signals that explicitly encourage solution diversity; extension to richer proof environments (file-level, contextual proving); and full hybridization with stepwise and value-guided search approaches.

A plausible implication is that combining unlikeliness-based reward schemes, multi-perspective search, and broader planning architectures will further improve sample efficiency, output diversity, and performance on formal benchmarks. The continued integration of open access datasets, tactic-state feedback, and hybrid proof generation is anticipated to shape the next generation of neural-symbolic formal reasoning systems.