Goedel-Prover: Open-Source ATP in Lean 4

Updated 24 August 2025

Goedel-Prover is an open-source series of large language models designed for synthesizing complete formal proofs in Lean 4 with state-of-the-art performance.
It integrates expert iteration, scaffolded data synthesis, and verifier-guided self-correction to automatically generate and refine formal proofs.
Benchmark results and model averaging techniques demonstrate its high sample efficiency and reproducible innovation in automated theorem proving research.

Goedel-Prover denotes a series of open-source LLMs and supporting frameworks for automated theorem proving (ATP) in formal mathematics, with an explicit focus on generating complete formal proofs in Lean 4. The Goedel-Prover project addresses the longstanding challenge of insufficient large-scale, formally verified mathematical datasets by combining high-volume autoformalization, expert iteration guided by formal proof verification, scaffolded data synthesis, and reinforcement learning with verifier feedback. Successive versions of Goedel-Prover (notably Goedel-Prover-V2) have established new state-of-the-art (SOTA) results for benchmark mathematical reasoning tasks, combining performance, sample efficiency, and open-source accessibility.

1. Model Architecture and Iterative Training Pipeline

Goedel-Prover is fundamentally a sequence of LLMs optimized for whole-proof synthesis in Lean 4, focusing on the direct generation of complete formal proofs later checked for validity by Lean’s compiler. The architecture inherits from and refines prior lines such as DeepSeek-Prover, but with a number of distinguishing features:

Expert Iteration: At each generation stage, the model produces candidate proofs for a large batch of formal statements. Only proofs verified by the Lean system are incorporated into the next iteration’s training set, ensuring continual improvement and correction of prior errors.
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL): Initial model refinement is performed through SFT using verified proof samples, followed by RL exploiting Lean’s binary accept/reject signal as a reward mechanism. RL optimization employs (for example) policy gradient objectives, such as GRPO, where the reward is strictly based on Lean’s acceptance of the proof.
Scaffolded Data Synthesis: Especially in Goedel-Prover-V2 (Lin et al., 5 Aug 2025), the data is actively scaffolded for curriculum learning. Synthetic problems are generated and ranked by difficulty:
- Formally, Lean’s extract_goal is used to mine failing subgoals from incorrect proofs. These are converted into easier, focused theorem-proving tasks, expanding the dataset (with their respective correct or negated variants).
- Informally, a pretrained LLM such as Qwen3-32B is used to generate graded-difficulty variants of the original problem, which are then formalized and filtered with human-in-the-loop or judge-prompts for faithfulness and correctness.
Verifier-Guided Self-Correction: When the model’s initial output fails Lean’s verification, error messages are parsed and provided as context for re-generation. Through iterative “compiler-in-the-loop” refinement, the model learns correction strategies, with each loop typically yielding consistent pass-rate improvement (2–3 percentage points on pass@32).
Model Averaging: To counteract the loss of output diversity often seen after successive SFT/RL cycles, model averaging is applied. Parameters from pre-fine-tuning and post-fine-tuning checkpoints are merged:

$\theta_{\text{avg}} = (1-\alpha)\cdot\theta_0 + \alpha\cdot\theta,$

with $\alpha$ tuned based on validation diversity metrics. This maintains sampling diversity for high pass@N metrics and robust proof strategies across benchmarks.

2. Dataset Construction and Statement Formalization

Goedel-Prover is distinguished by its automated construction of massive high-quality datasets for formal proof search:

Statement Autoformalization: Natural language problems are formalized into Lean statements via two trained “formalizers” (A and B), themselves LLMs fine-tuned on large pairs of informal/formal statements (from Lean Workbook, Numina, AOPS, and Claude-annotated data).
- Each problem is processed with both formalizers, generating multiple candidate formalizations. These are filtered with Lean’s compiler—candidates must compile (with e.g. := by sorry stubs) and pass both content faithfulness checks (FC Test) and syntactic correctness (CC Test).
Proof Generation and Mining: Candidate proofs for formalized statements are generated, Lean-verified, and merged into a continually expanding ground truth dataset (Goedel-Pset-v1, Goedel-Pset-v1-solved).
Scaffolded Hardness and Negation: In V2, tasks of increasing difficulty are synthesized by extracting new theorems or even negated forms from failed proof attempts, ensuring curriculum richness and coverage of edge-case logical phenomena.

Dataset	Source	Size (~statements)	Role in Training
Goedel-Pset-v1	Numina, AOPS, etc.	1.64M	Statement formalization
Goedel-Pset-v1-solved	Iterative proof gen.	~800K	Prover SFT/RL
Lean Workbook	Competition, textbook	~29.7K proved	Benchmark, validation

This pipeline doubles the solved Lean Workbook problems over prior models and ensures both wide coverage and depth in training.

3. Innovations in Training and Automation

Distinctive innovations of Goedel-Prover include:

Whole-Proof Generation: Instead of stepwise proof interaction, the model directly synthesizes full proofs. This reduces Lean interaction time and enables bulk sampling for pass@N metrics, though also increases demand for calibration of proof validity via the verifier.
Self-Refinement via Verifier Feedback: Goedel-Prover-V2 systematizes iterative self-correction: compiler errors are processed and fed back through long chain-of-thought prompts, with correction loops implemented until a valid proof is achieved or a time/memory budget is reached.
Chain-of-Thought (CoT) and Cognitive Data: The Leanabell-Prover work (Zhang et al., 8 Apr 2025), built on Goedel-Prover, augments training with synthetic CoT data (e.g. Lean Completion and Rewriting strategies), simulating human-like debugging, self-reflection, and hypothesis adaptation by leveraging error-messages and explicit explanatory templates.
Model Averaging: The use of checkpoint blending throughout training stages not only improves output diversity (crucial for pass@N) but empirically yields SOTA sample-efficiency.

4. Benchmark Performance and Sample Efficiency

Goedel-Prover and its successors achieve leading results on standard ATP benchmarks:

Model	Pass@32 (miniF2F)	Pass@184/512/1024 (PutnamBench)	Size (parameters)
Goedel-Prover-SFT	57.6%	7 (pass@512)	~13B?
Goedel-Prover-V2-8B	84.6%	-	8B
Goedel-Prover-V2-32B	88.1% (90.4%*sc)	86 (pass@184)	32B
DeepSeek-Prover-V2-671B	<84%	47 (pass@1024)	671B
Leanabell-Prover-GD-RL	59.8%	-	?

On miniF2F, Goedel-Prover-V2 achieves both higher pass@32 and dramatically higher sample-efficiency than all previous models, even surpassing much larger non-open systems under compute constraints (Lin et al., 5 Aug 2025).

Performance is realized under pure whole-proof generation evaluated by Lean as oracle. Models are fully open-source, including all code, datasets, and even the iterative training recipes.

5. Design Choices, Algorithms, and Meta-Learning

Several technical and design choices underpin Goedel-Prover’s effectiveness:

Formalizer Diversity: By employing dual formalizers trained on different F–I pairs, the system gains robustness to stylistic and logical variations in formalization. Empirically, candidate diversity improves proof solvability when both are applied and filtered.
Curricular Data Synthesis: Synthetic tasks guide the model through an ascending hierarchy of difficulty, facilitating meta-learning and enabling mastery of incrementally more complex competitions mathematics.
Reward-Mediated RL: The RL step utilizes the Lean accepter as a binary reward (success/penalty), with normalized advantage objectives and, in Leanabell-Prover, clipped policy optimization (GRPO) and no KL penalty, to maximize the frequencies of Lean-accepted outputs.
- The RL objective can be written as:
$J_{\text{GRPO}}(\theta) = \mathbb{E}\Bigg[\frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_t \min\left(r_{i,t}(\theta)\hat{A}_{i,t},\, \operatorname{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_{i,t}\right)\Bigg],$

where $r_{i,t}(\theta)$ is the relative likelihood and $\hat{A}_{i,t}$ the normalized advantage for sample $i$ at token $t$ .
Integrated Use of External Tools: Experiments with symbolic computation support (e.g. SymPy for algebraic simplification) indicate that about 9.4% of miniF2F problems are made accessible after symbolic pre-processing, motivating future integration for transcendentals and algebraic domains.

6. Open Source Impact and Community Ecosystem

A foundational principle of Goedel-Prover is unrestricted open-source accessibility, in contrast to commercial ATP systems:

Full model weights, codebases, and benchmarks are released (e.g., at https://github.com/Goedel-LM/Goedel-Prover-V2).
All autoformalized statements, verified proofs, and the entire data production pipeline are public.
The ecosystem catalyzes community-driven ATP research—by providing SOTA baselines that can be independently validated, extended, and tested, the project significantly lowers the barrier to high-performance formal reasoning research.

This open approach nearly doubles the count of Lean Workbook problems with discovered formal proofs (from 15.7K to 29.7K), provides the first place Open Source ATP leaderboard performance on PutnamBench and miniF2F, and establishes a scalable template for future LLM-based formal reasoning systems in mathematics and beyond.

Goedel-Prover thus exemplifies a data- and feedback-driven approach to formal proof synthesis, combining advances in LLM training, self-correction, and formal verification to push the boundaries of automated mathematics. Its approach—curricular data synthesis, verifier-guided revision, policy averaging, and commitment to reproducibility—marks it as a reference point in the continuing evolution of open AI-driven formal theorem proving (Lin et al., 11 Feb 2025, Zhang et al., 8 Apr 2025, Lin et al., 5 Aug 2025).