DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition (2504.21801v1)

Published 30 Apr 2025 in cs.CL and cs.AI

Abstract: We introduce DeepSeek-Prover-V2, an open-source LLM designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3's step-by-step reasoning, to create an initial cold start for reinforcement learning. This process enables us to integrate both informal and formal mathematical reasoning into a unified model. The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench. In addition to standard benchmarks, we introduce ProverBench, a collection of 325 formalized problems, to enrich our evaluation, including 15 selected problems from the recent AIME competitions (years 24-25). Further evaluation on these 15 AIME problems shows that the model successfully solves 6 of them. In comparison, DeepSeek-V3 solves 8 of these problems using majority voting, highlighting that the gap between formal and informal mathematical reasoning in LLMs is substantially narrowing.

Summary

The paper introduces a recursive subgoal decomposition framework that synthesizes high-quality training data by linking natural language reasoning to formal Lean proofs.
It employs a dual-stage training strategy combining a large general-purpose model with a smaller prover model, enhanced by reinforcement and curriculum learning.
Empirical evaluations demonstrate state-of-the-art performance on benchmarks like MiniF2F-test and ProverBench, showcasing improved formal reasoning and generalization.

DeepSeek-Prover-V2 introduces an open-source LLM specifically designed for formal theorem proving within the Lean 4 proof assistant. The paper presents a novel training approach centered around generating high-quality synthetic data through a recursive theorem proving pipeline powered by a larger general-purpose model, DeepSeek-V3 (2412.19437).

The core methodology involves bridging the gap between informal mathematical reasoning and the strict requirements of formal verification. This is achieved through a process of subgoal decomposition guided by a LLM. For a given formal theorem statement, DeepSeek-V3 is prompted to first analyze the problem in natural language, providing a chain-of-thought reasoning process. It then decomposes the overall proof into a sequence of intermediate steps, expressed as formal Lean have statements with sorry placeholders for the detailed proof. This process mirrors how human mathematicians break down complex problems into smaller lemmas.

To make this decomposition tractable, the authors employ a recursive resolution strategy. The intermediate Lean statements generated by DeepSeek-V3 define subgoals. These subgoals are then extracted and presented as new problems to a smaller, more computationally efficient 7B prover model (based on DeepSeek-Prover-V1.5-Base-7B). The 7B model is tasked with generating the specific Lean tactics or proofs required to fill in the sorry placeholders for each subgoal. The subgoals are structured such that results from previous steps can be used as premises for subsequent ones (as illustrated in Figure 3), enabling a modular and localized proof search. Once all decomposed subgoals are successfully proven by the 7B model, the individual proof segments are combined to construct a complete formal proof for the original, more complex theorem.

This recursive proof search pipeline generates valuable synthetic training data. The successful step-by-step formal proofs, synthesized from the resolved subgoals, are paired with DeepSeek-V3's initial natural-language chain-of-thought reasoning process. This combined data forms a high-quality "cold-start" dataset that explicitly links informal reasoning steps to concrete formal proof structures. This dataset is used to train the DeepSeek-Prover-V2 model, unifying the capabilities of high-level reasoning and detailed formalization within a single model.

To further enhance the model's capabilities and address problems not solved during the initial data collection, the authors incorporate a curriculum learning framework and reinforcement learning. The decomposed subgoals are used to generate conjectural theorems, creating a range of training tasks with varying difficulty. These tasks are integrated into an expert iteration loop, where the model attempts to solve unsolved problems, and successful proofs are added back to the training data, progressively improving the model's performance on challenging theorems. Following supervised fine-tuning on the synthetic data and other datasets (2502.07640, 2502.00212, 2410.15700), a reinforcement learning stage is applied. The model is trained using Group Relative Policy Optimization (GRPO) (2402.03300), optimizing the policy based on binary correct-or-incorrect feedback from the Lean proof assistant. A consistency reward is also introduced in early training to encourage the generated proofs to align structurally with the decomposed have statements from the chain-of-thought, improving reasoning flow.

DeepSeek-Prover-V2 is trained in two main stages, resulting in two generation modes: a non-Chain-of-Thought (non-CoT) mode for efficient, concise proof generation and a Chain-of-Thought (CoT) mode that explicitly articulates intermediate reasoning steps before generating the formal proof (examples provided in Appendix A). The non-CoT mode is primarily trained via expert iteration on data from the recursive pipeline and other sources, while the CoT mode is fine-tuned on the synthetic CoT data and further enhanced by reinforcement learning. A smaller 7B model is also distilled from the larger 671B model and trained with reinforcement learning for more cost-efficient proving.

The empirical evaluation demonstrates that DeepSeek-Prover-V2-671B achieves state-of-the-art performance on several formal theorem proving benchmarks. On the MiniF2F-test dataset (2009.03393), the CoT mode achieves 88.9% Pass@8192, outperforming previous models (Table 1). The 7B version also shows competitive performance, surpassing other open-source provers. The CoT mode consistently outperforms the non-CoT mode, highlighting the benefit of explicitly modeling the reasoning process, albeit at the cost of generating significantly more tokens (Table 2). The model demonstrates strong generalization to undergraduate-level mathematics, achieving 37.1% Pass@1024 on ProofNet-test (2302.12433) and solving 49 out of 658 problems on PutnamBench (2407.10040) with the 671B CoT model (Table 3). Interestingly, the 7B non-CoT model discovered specific proof skills (e.g., using Cardinal.toNat and Cardinal.natCast_inj for finite cardinality problems) that enabled it to solve some problems the 671B model did not (examples in Appendix B).

The paper also introduces ProverBench, a new benchmark dataset comprising 325 formalized problems, including 15 problems from recent AIME competitions and 310 problems from textbooks covering diverse mathematical areas (Table 5). DeepSeek-Prover-V2-671B with CoT achieves 59.1% Pass@512 on ProverBench overall and solves 6 out of the 15 formalized AIME problems (Table 4), demonstrating capabilities on challenging competition problems. A comparison with DeepSeek-V3's performance on the same AIME problems (solving 8 out of 15 via natural language reasoning) suggests that the gap between informal and formal mathematical reasoning in LLMs is closing.

In summary, DeepSeek-Prover-V2 advances neural theorem proving by synthesizing high-quality training data through a recursive subgoal decomposition pipeline powered by a large general-purpose LLM, unifying informal and formal reasoning. The training incorporates curriculum learning and reinforcement learning with a focus on structural alignment to reasoning steps. The resulting models achieve state-of-the-art performance across various benchmarks, showcasing improved formal reasoning capabilities and generalization. The data synthesis approach and the combination of different model sizes and generation modes offer practical strategies for building powerful and efficient theorem provers. Future work aims to apply this paradigm to tackle even more challenging problems like those from the International Mathematical Olympiad.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1917775751547326975

https://twitter.com/f14bertolotti/status/1917930602398966104

https://twitter.com/Jose_A_Alonso/status/1917893058613465125

https://twitter.com/fly51fly/status/1918784750929453422

https://twitter.com/TheTuringPost/status/1919663535409549443

https://twitter.com/samim/status/1918731860588253637

YouTube

Show All Videos