DeepSeek-Prover-V2-671B: Formal Math Prover

Updated 30 June 2025

DeepSeek-Prover-V2-671B is an open-source large language model for formal theorem proving in Lean 4 that integrates informal chain-of-thought planning with verified proof synthesis.
Its innovative architecture leverages a Mixture-of-Experts transformer with Multi-head Latent Attention, achieving benchmark-leading results on MiniF2F and PutnamBench.
The training pipeline combines synthetic data generation, curriculum learning, and reinforcement learning to effectively narrow the gap between human-like reasoning and formal verification.

DeepSeek-Prover-V2-671B is an open-source LLM designed for state-of-the-art formal mathematical theorem proving in Lean 4. It is grounded in the DeepSeek-V3 architecture (Mixture-of-Experts Transformer, 671B parameters, 37B active per token), and specifically engineered to unify informal chain-of-thought (CoT) reasoning with formally verifiable Lean code generation. DeepSeek-Prover-V2-671B achieves benchmark-leading results on MiniF2F and PutnamBench, and demonstrates that the longstanding gap between informal (natural language) and formal (proof assistant-compatible) mathematical reasoning in LLMs is narrowing considerably.

1. Model Architecture and Reasoning Modes

DeepSeek-Prover-V2-671B is built upon the DeepSeek-V3 Mixture-of-Experts architecture with two central innovations: Multi-head Latent Attention (MLA) for memory-efficient context handling, and DeepSeekMoE for scalable, communication-efficient sparse computation. The model’s formal reasoning specialization involves extensive SFT and RL on Lean 4 proof tasks, using both chain-of-thought and direct proof script synthesis.

Chain-of-Thought (CoT) Reasoning: The model supports step-by-step decompositions, mimicking human problem-solving with explicit intermediate steps that are subsequently formalized and verified in Lean 4.
Concise (non-CoT) Mode: For efficient proof script generation without overt stepwise explanations, suitable for production or direct Lean 4 ingestion.
Formal Target: Native output is Lean 4 code, ensuring compatibility with modern mathematical proof assistants.

This integration of informal and formal reasoning enables DeepSeek-Prover-V2-671B to plan, decompose, and verify mathematic arguments much like an experienced human mathematician, but at the scale and speed of a large LLM.

2. Training Pipeline: Synthetic Data, Curriculum, and RL

The training procedure for DeepSeek-Prover-V2-671B blends synthetic data generation with advanced curriculum and reinforcement learning techniques:

Recursive Theorem Proving Pipeline with DeepSeek-V3:
- Begin with DeepSeek-V3 generating high-level proof sketches and subgoal decompositions for complex mathematics problems.
- Each subgoal is formalized to Lean 4 code (with sorry placeholders where needed).
- These subgoals are recursively solved using a smaller LLM-based prover; all successful subproofs are compiled into a full, verifiable Lean script.
Curriculum Construction:
- Subgoals and their formalizations are added to the data pipeline, building from easier (shorter, decomposed) problems up to the original complex theorems.
- Both chain-of-thought (annotated) and direct script samples are used.
Reinforcement Learning (GRPO Algorithm):
- After initial SFT, the model is improved via Group Relative Policy Optimization, using formal proof verification in Lean 4 as a binary reward signal.
- An auxiliary structure consistency reward early in RL promotes outputs that respect the planned proof structure and avoid skipping necessary intermediate lemmas.
Cognitive Behaviors and Data Augmentation:
- Training incorporates prompts and samples designed to promote reflection, error detection, and correction, echoing best practices found in human mathematical reasoning (Zhang et al., 8 Apr 2025).
- The combined effect is a model both adept at producing correct proofs and robust at self-improvement through recursive bootstrapping.

3. Performance Metrics and Benchmark Results

DeepSeek-Prover-V2-671B sets new performance records across a range of formal mathematics benchmarks:

Benchmark	Pass@32 (%)	Pass@8192/512 (%)	Problems Solved
MiniF2F-test	82.4	88.9	217/244
ProofNet-test	—	37.1 (1024)
PutnamBench	—	—	49/658
ProverBench	52.9	59.1 (512)
ProverBench (AIME2024/25)	—	6/15 (Lean proofs)
CombiBench	—	12/77	(domain transfer test)

Comparison to Other Models:
- Greatly surpasses prior 7B and stepwise models in pass rates (e.g., >10% higher on MiniF2F-test).
- For AIME 2024–2025, DeepSeek-Prover-V2-671B (solves 6/15) approaches the performance of DeepSeek-V3 in informal answer-finding mode (8/15), illustrating the reduction of the informal–formal gap.
Proof Quality:
- Proofs generated are formally verified in Lean, are concise, and often leverage competition-math-style strategies.
- CoT mode substantially improves sample efficiency and coverage, validating the efficacy of explicit intermediate reasoning.

4. Model Efficiency, Scaling, and Quantization

DeepSeek-Prover-V2-671B’s architecture is optimized for both computational efficiency and large-scale deployment:

Sparsity and Active Parameters: Only 37B parameters are activated per token (out of 671B total), reducing both training and inference cost compared to fully dense models.
Memory and Throughput: MLA significantly compresses attention KV memory (up to 93.3% reduction compared to standard MHA), enabling large-context operation (128K tokens).
Training Infrastructure: Distributed pipeline, tensor, and expert parallelism alongside advanced ZeRO optimization keeps device memory manageable even at largest scales (Zhang et al., 11 Feb 2025).
Quantization: 4-bit quantization (Q4_K_M) and dynamic 3-bit quantization (DQ3_K_M) reduce model memory to 377GB and 281GB respectively, with negligible accuracy drop (<1%). DQ3_K_M enables single-node, multi-GPU deployment for local/private inference (Zhao et al., 5 May 2025).

5. Impact on Formal and Informal Reasoning

DeepSeek-Prover-V2-671B’s success illustrates the convergence between informal mathematical problem solving and formal proof automation:

Empirical Narrowing of the Reasoning Gap: On challenging benchmarks such as AIME, the model’s rate of formal proof generation nearly matches that of informal answer-finding via LLMs like DeepSeek-V3.
Automated Curriculum: By decomposing and formalizing subgoals, DeepSeek-Prover-V2-671B enables scalable generation of training material, driving self-improvement analogous to “self-play” in game AI.
Open Source and Evaluation: With the release of ProverBench and trained models, the community gains new testbeds and resources for further research.

6. Methodological Innovations and Synergies

The DeepSeek-Prover-V2-671B approach demonstrates key methodological advances:

Recursive Data Generation: Synthetic, recursively verified proof data overcomes the scarcity of high-quality formal training resources, as formalization and proof search are jointly LLM-driven (Xin et al., 2024).
Unification of Reasoning Modalities: The use of chain-of-thought both for informal planning and as an input to RL enables transfer of strategies between natural language and formal code domains.
RL on Lean 4 Rewards: The binary verifier signal from Lean 4 is essential for aligning model behavior with strict formal correctness (Zhang et al., 8 Apr 2025, Ren et al., 30 Apr 2025).
Efficient Test-Time Computation: Search strategies such as block-wise value-guided search and temporal consistency methods demonstrated on DeepSeek-R1 and its distilled variants may plausibly offer further efficiency and reliability boosters when integrated at 671B scale (Wang et al., 23 May 2025, Guo et al., 18 Mar 2025).

7. Comparative Landscape and Future Directions

Relative to Human Mathematicians and Prior LLMs: DeepSeek-Prover-V2-671B, by matching informal solution rates with formal proof rates, moves LLMs closer to supporting both mathematician-style speculation and rigorous formal verification in a unified system.
Stepwise vs. Whole-Proof Paradigms: While DeepSeek-Prover-V2-671B adopts a “whole-proof by subgoal” approach, recent stepwise provers employing multi-perspective, critic-plus-heuristic search (e.g., MPS-Prover) achieve competitive or even superior efficiency in certain regimes with far fewer parameters (Liang et al., 16 May 2025).
Implications for Automated Science: The DeepSeek-Prover-V2-671B pipeline—particularly the recursive, RL-driven augmentation of proof data—suggests scalable pathways for LLMs to act as both automated mathematicians and formal verifiers in collaborative scientific workflows.

References Table

Paper/Resource	Contribution to DeepSeek-Prover-V2-671B
DeepSeek-V3 Technical Report (DeepSeek-AI et al., 2024)	Base architecture, MLA/MoE, training and scaling strategy
DeepSeek-Prover-V2: RL for Subgoal Decomposition (Ren et al., 30 Apr 2025)	Main methodology, benchmark results, recursive pipeline
Leanabell-Prover: Posttraining & RL Scaling (Zhang et al., 8 Apr 2025)	RL pipeline, cognitive data, reflection in training
Quantitative Analysis of Performance Drop in Quantization (Zhao et al., 5 May 2025)	Efficient deployment, quantization strategy
Value-Guided Search for Efficient CoT Reasoning (Wang et al., 23 May 2025)	Efficient test-time search/selection methods for chain-of-thought
MPS-Prover: Multi-Perspective Stepwise Proving (Liang et al., 16 May 2025)	Efficient stepwise proof search as a comparison paradigm
DeepSeek-Prover: Synthetic Data, Bootstrapping (Xin et al., 2024)	Early pipeline for large-scale formal proof data generation

Conclusion

DeepSeek-Prover-V2-671B exemplifies the successful unification of informal reasoning and formal mathematical proof synthesis within a large Mixture-of-Experts LLM. Leveraging recursive data generation, strong RL on formal reward signals, and architectural efficiencies originating from DeepSeek-V3, it raises state-of-the-art pass rates on verifiable Lean 4 proof benchmarks. The repercussion is a narrowing divide between “thinking like a mathematician” and “proving like a machine,” and a rapidly expanding frontier for open-source formal reasoning systems at scale.