DeepSeek-Prover-V2: Neural Prover in Lean 4

Updated 21 August 2025

DeepSeek-Prover-V2 is a neural-symbolic LLM that integrates formal theorem proving in Lean 4 using recursive subgoal decomposition.
It employs dual generation modes—non–CoT for rapid inference and CoT for stepwise reasoning—achieving up to 88.9% pass rates on miniF2F-test.
The model is trained via a multi-stage process combining supervised fine-tuning and reinforcement learning with verifier-based consistency rewards.

DeepSeek-Prover-V2 is a LLM designed for formal theorem proving in Lean 4, notable for a recursive subgoal decomposition pipeline, advanced reinforcement learning with consistency objectives, and explicit integration of both natural language and formal reasoning within its training and generation processes. Achieving up to 88.9% pass rate on miniF2F-test and solving 49 out of 658 problems in PutnamBench at the 671B scale, it defines a modern paradigm for neural-symbolic automated reasoning.

1. System Overview and Problem Decomposition

DeepSeek-Prover-V2 is an open-source, multi-scale LLM-based neural theorem prover targeting high-efficiency formal proof synthesis in the Lean 4 environment (Ren et al., 30 Apr 2025). The model combines two generation strategies for proof construction:

Non–Chain-of-Thought (non–CoT) mode: Direct, minimal Lean proof output for rapid inference.
Chain-of-Thought (CoT) mode: Stepwise, interleaved informal reasoning followed by completed Lean proofs, optimized for sample efficiency and interpretability in challenging domains.

A central innovation is the recursive subgoal decomposition pipeline. Problem formulation begins by prompting a general-purpose baseline (DeepSeek-V3) to decompose complex mathematical statements into a sequence of formally stated subgoals. The pipeline proceeds as follows:

DeepSeek-V3 produces an informal chain-of-thought with a Lean sketch (using sorry placeholders for unresolved subgoals).
A specialized 7B prover (or a more powerful variant) recursively solves each subgoal, synthesizing Lean code fragments.
The resulting complete proof is synthesized with the original informal reasoning, forming paired informal-formal data suitable for supervised and reinforcement learning.

This hierarchical, compositional framework enables the model to handle problems with deep reasoning chains and promotes detailed structural alignment between natural and formal proof modalities.

2. Training Methodology and Reinforcement Learning

DeepSeek-Prover-V2’s training is a multi-stage process (Ren et al., 30 Apr 2025):

Cold-Start Data Generation: Training data is bootstrapped by subgoal decomposition using DeepSeek-V3, followed by recursive subgoal solution. The resultant dataset contains pairs of informal chain-of-thought (CoT) traces and corresponding verified Lean 4 proofs, synthesized from both high-school/college-level competition problems (miniF2F, ProverBench) and undergraduate benchmarks (PutnamBench, ProofNet).
Supervised Fine-Tuning: The model is first fine-tuned on both non–CoT and CoT data in a curriculum learning regime. Long-context windows (up to 32,768 tokens for the largest model) permit training on extended proof traces and substantial stepwise reasoning.
Reinforcement Learning with Consistency Rewards:
- Objective: Maximize the probability that the Lean 4 verifier accepts a sampled proof and that the proof’s structure aligns with subgoal decomposition.
- Implementation: Training applies Group Relative Policy Optimization (GRPO), using a reward signal $R(\tau)$ of 1 for each proof $\tau$ accepted by the Lean compiler, and 0 otherwise. An additional "consistency reward" encourages that the formal proof closely matches the guidance of the initial chain-of-thought/decomposition.
- Iteration: Correct proofs generated by the model are iteratively incorporated back into the dataset, refining the policy via expert iteration.

This schema couples the benefits of imitation learning (dense, structure-rich supervision from synthetic expert data) with policy optimization under formal verification constraints.

3. Architecture, Scalability, and Resource Considerations

DeepSeek-Prover-V2 is available in several scales, notably:

7B-parameter variant: Computationally tractable, competitive on moderate benchmarks, forming the backbone of recursive subgoal-solving in the decomposition pipeline.
671B-parameter variant: Extended context size (up to 32k tokens), critical for expressing and learning long proofs. This version yields steep improvements in sample efficiency and accuracy as sample budgets increase.

Training and inference employ large-scale mixture-of-experts transformer architectures, with detailed memory allocation and throughput optimizations (Zhang et al., 11 Feb 2025):

Activation recomputation and 3D parallelism strategies are applied to control device-level memory consumption.
ZeRO optimizations reduce per-device parameter storage from ~11.6GB to ~1.4GB in large-scale distributed settings.

4. Empirical Performance and Benchmarking

The primary empirical results demonstrate state-of-the-art performance across several formal mathematics benchmarks (Ren et al., 30 Apr 2025):

miniF2F-test: DeepSeek-Prover-V2–671B achieves up to 88.9% pass ratio in CoT mode at large (8192) sample budgets.
PutnamBench: Solves 49 out of 658 problems, surpassing prior state-of-the-art open models (e.g., Goedel-Prover-SFT, STP).
ProverBench: Newly released, challenging suite of 325 problems, including 15 hard AIME 2024–2025 instances (6 solved by DeepSeek-Prover-V2 versus 8 by DeepSeek-V3 in informal mode, showing a narrowing gap between formal and informal mathematical reasoning).

Sample efficiency varies by model scale and mode; the 7B model is competitive but scales sublinearly compared to the 671B variant. Rich context handling enables the model to outperform prior work on problems requiring extended multi-step reasoning.

5. Integration of Informal and Formal Reasoning

A hallmark of DeepSeek-Prover-V2 is its explicit integration of natural language reasoning into the formal synthesis loop:

Subgoal decomposition is performed using free-form chain-of-thought generated by a strong generalist LLM (DeepSeek-V3), which is then mapped to formal Lean subgoals.
The CoT data structures the informal explanation with the formal reasoning trace, explicitly connecting steps that would otherwise remain implicit.
Reinforcement learning’s consistency reward ensures that the produced Lean proofs closely track the structure outlined in the informal CoT, facilitating alignment between "human-style" strategy and formally verifiable code.

This methodology is designed to close the gap between models tuned for informal mathematical explanation and those trained for rigorous formalization.

When compared to contemporary and prior automated provers:

Versus Goedel-Prover-V2: DeepSeek-Prover-V2–671B achieves 82.4% pass@32 on miniF2F, while Goedel-Prover-V2–8B (∼80× smaller) attains 84.6%, and the Goedel-Prover-V2–32B surpasses it with 88.1% (up to 90.4% with self-correction) (Lin et al., 5 Aug 2025).
Versus Leanabell-Prover-V2: Integration of verifier feedback and iterative self-correction yields an absolute 2.0% improvement over DeepSeek’s 7B base at pass@128 on miniF2F (Ji et al., 11 Jul 2025). Both systems reveal that interactive feedback integration significantly boosts inference reliability.
Resource Efficiency: DeepSeek-Prover-V2’s largest variant outperforms previous SOTA on many metrics, but Goedel-Prover-V2’s smaller models deliver similar or better performance with far fewer parameters and compute.
Test-Time Scaling: Innovations such as value-guided search (VGS) (Wang et al., 23 May 2025) and hybrid LLM-guided frameworks (e.g., ProofCompass (Wischermann et al., 18 Jul 2025)) further increase efficiency and accuracy under constrained sampling budgets.

A plausible implication is that DeepSeek-Prover-V2’s main strengths lie in its robust CoT-formal integration and scalability, while efficiency and sample diversity may benefit from the advanced correction and scoring strategies seen in the latest competitive systems.

7. Impact and Prospects

The DeepSeek-Prover-V2 paradigm marks a significant advance in integrating recursive problem decomposition, formal reasoning, and reinforcement learning—all at scale and in open-source form (Ren et al., 30 Apr 2025). The unified handling of informal and formal reasoning not only boosts pass rates on challenging formal benchmarks but demonstrates diminishing separation between stepwise informal reasoning and fully formalized proof construction.

Future prospects include:

Further scaling, including optimization of RL objectives and curriculum learning driven by synthetic scaffolding.
Integration of more granular verifier interaction (as seen in Goedel-Prover-V2 and Leanabell-Prover-V2) to further narrow the gap in inference efficiency.
Expansion of training data to wider mathematical domains, leveraging diverse autoformalization and robust data synthesis approaches.
Adoption of block-wise value guidance and hybrid architectures for efficient test-time scaling under compute constraints.

The transparent release of code, models, and newly benchmarked datasets (e.g., ProverBench) is expected to accelerate community-wide progress toward general, robust, and interpretable neural-symbolic theorem proving.