DeepSeek-Prover-V1.5: Lean 4 Theorem Prover
- The paper introduces a two-stage training framework combining supervised fine-tuning and reinforcement learning from proof assistant feedback to optimize Lean-verified proofs.
- It implements a hybrid truncate-and-resume mechanism with reward-maximizing tree search to efficiently explore vast formal proof spaces.
- The system achieves state-of-the-art accuracy on benchmarks like miniF2F and ProofNet, outperforming previous models in formal mathematical reasoning.
DeepSeek-Prover-V1.5 is an open-source formal theorem proving LLM specialized for Lean 4, designed to advance both model-centric and search-centric paradigms for automated mathematical reasoning. Its development introduced new supervised and reward-based learning strategies, integration of explicit proof state feedback, and a scalable hybrid proof generation/search algorithm, establishing state-of-the-art performance on diverse formal mathematics benchmarks.
1. Model Architecture and Formalization Focus
DeepSeek-Prover-V1.5 adopts a 7B-parameter LLM backbone (pre-trained as DeepSeekMath-Base) and specializes it for formal mathematical languages. The model is extensively exposed to Lean 4, with auxiliary representation of Isabelle and Metamath. Lean 4 tactic syntax and proof structure are central to its generative capabilities—models parse both global proof plans (whole-pass) and local tactic transitions (interactive, state-annotated) with explicit support for chain-of-thought (CoT) logic, as well as auxiliary tactic state annotation via LeanDojo.
The architecture supports both single-pass whole-proof generation—where an entire proof script is generated up-front—and interactive, stepwise generation, wherein proofs are constructed incrementally with intermediate verification and resumption informed by tactic state annotations.
2. Training Methodology: SFT and Reinforcement from Proof Assistant Feedback
Training employs a two-stage approach:
- Supervised Fine-Tuning (SFT):
- The model is fine-tuned on an expanded corpus (
~9.6Msequences) built on expert iteration over Mathlib4, Lean Workbook, synthetic DeepSeek-Prover-V1 data, and miniF2F/ProofNet problems. - Chain-of-thought comments generated via DeepSeek-Coder V2 236B are interleaved within Lean proofs, enforcing explicit formal reasoning steps.
- Tactic state annotations (from Lean REPL) are appended as block comments and used as auxiliary supervision, enabling the truncate-and-resume protocol within tree search and stepwise inference.
- The model is fine-tuned on an expanded corpus (
- Reinforcement Learning from Proof Assistant Feedback (RLPAF):
- Group Relative Policy Optimization (GRPO) is utilized, sampling 32 candidate proofs per prompt; reward is assigned as 1 for Lean-verified proofs, otherwise 0.
- Objective: Optimize the relative binary reward signal, aligning model policy toward search-space regions reliably passing Lean 4 verification.
- RL is applied across a curated set of ~4.5k theorems, leveraging CoT-enriched prompts and exploiting the model’s tactic-state prediction capabilities.
3. Truncate-and-Resume Mechanism and RMaxTS Search
The "truncate-and-resume" mechanism is a hybrid proof generation protocol, enabling correction and extension of partial proofs:
- The model generates a complete proof, which is submitted to the Lean verifier.
- On verification error, the proof is truncated at the first failed tactic; successful prefix and current tactic state are used to prompt the model for continuation.
- This iterative resumption allows error recovery and deeper search.
RMaxTS (Reward-Maximizing Tree Search) is a variant of Monte-Carlo Tree Search adapted for the sparse rewards typical in formal theorem proving:
- Intrinsic reward: New tree node expansion in the proof search tree serves as a proxy for extrinsic reward, allowing non-trivial exploration.
- Discounted UCB: Non-stationary confidence bounds guide selection, weighting recent successes higher and mitigating local plateaus.
- Highly parallelized runners and asynchronous Lean calls support scalable search over vast proof spaces.
4. Datasets, Formal Language Specialization, and Data Augmentations
- Pre-training utilizes DeepSeekMath-Base (mathematical code, Lean, Isabelle, Metamath).
- Fine-tuning leverages Mathlib4, synthetic theorems, expert-iteration data, and benchmark-driven problems.
- All proofs are enriched with chain-of-thought annotations and intermediate tactic state comments, supporting both training and inference workflows.
- Expert iteration cycles expand coverage and resilience by systematically retraining on new Lean-confirmed proofs.
5. Benchmark Evaluation and Performance Gains
DeepSeek-Prover-V1.5 achieves state-of-the-art accuracy:
- miniF2F-test (High school Olympiad)
- Single-pass: 60.2%
- RMaxTS (Tree search): 63.5%
- Previous SOTA (DSP-V1, InternLM2-StepProver): 50.0%, 54.5%
- ProofNet-test (Undergraduate)
- Single-pass: 23.7%
- RMaxTS: 25.3%
- Previous SOTA: ReProver 13.8%, InternLM2-StepProver 18.1%
The integration of RLPAF and search-centric inference delivers orthogonal improvements: tree search increases sample efficiency and robustness to error, while RL-trained models align generated proofs to verifiable Lean standards, especially for distribution-shifted or "hard" proof problems.
6. Practical Implications, Model Selection, and Technical Significance
DeepSeek-Prover-V1.5’s design enables robust and versatile Lean formalization workflows:
- Unified Model & Search Interface: The same model supports both fast, single-pass generation (for easy theorems) and iterative, correction-rich search (for harder theorems or ambiguous spaces).
- Scalable Formal Reasoning: Tree search and state annotation strategies support efficient exploration of large proof spaces, essential for competition-level and university-level problems.
- Reproducible Formalization: Direct Lean 4 feedback loop guarantees solution correctness, offering a practical foundation for integrating LLMs with interactive theorem proving.
- Model Selection Guidance: Provides actionable insights for real-world deployment, with performance tiers across logical reasoning, planning, and text-centric tasks as detailed in (Zhao et al., 16 Feb 2025).
7. Context, Related Methodologies, and Limitations
DeepSeek-Prover-V1.5 advances over DeepSeek-Prover-V1 principally through the inclusion of RLPAF, thought-augmented data, and search-centric methodologies. However, subsequent models such as Goedel-Prover (Lin et al., 11 Feb 2025), Leanabell-Prover (Zhang et al., 8 Apr 2025), and MA-LoT (Wang et al., 5 Mar 2025) demonstrate further gains, exploiting even larger datasets, multi-agent chain-of-thought reasoning, and diverse formalization styles.
Limitations:
- Reasoning enhancements may reduce performance on text-centric tasks; users should consult benchmarking tables and model selection charts (Zhao et al., 16 Feb 2025).
- Sample efficiency improves with hybrid search and RL alignment but may be limited for certain problem distributions.
- For tasks requiring nuanced text understanding, entity extraction, or named entity recognition, alternative model families (Qwen, Llama) or instruction-tuned baselines may be preferable for cost-sensitive deployment.
In summary, DeepSeek-Prover-V1.5 embodies a modern, reward-aligned, and search-integrated approach for Lean 4 formal theorem proving, establishing strong new benchmarks in accuracy and proof space exploration efficiency. Its architecture and training regimen serve as a reference for subsequent advancements in automated formal mathematical reasoning.