Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Goedel-Prover-V2: Advances in Formal Theorem Proving

Updated 6 August 2025
  • Goedel-Prover-V2 is an advanced series of large language models for automated formal theorem proving in Lean 4, combining scalable data synthesis with verifier-guided self-correction.
  • It introduces scaffolded data synthesis that constructs subproblems from failed proof attempts, enhancing training with both positive and negative examples.
  • The system leverages a verifier-check loop and model averaging to boost proof accuracy, outperforming larger models while ensuring reproducible, open-source research.

Goedel-Prover-V2 is a series of LLMs and an open-source infrastructure that advance the state-of-the-art in automated formal theorem proving, particularly in Lean 4, by introducing scalable data synthesis, verifier-guided self-correction, and model averaging schemes. These models, with variants at 8B and 32B parameters, achieve unprecedented performance on established mathematical proof benchmarks—including MiniF2F and PutnamBench—demonstrating stronger sample efficiency, higher robustness to difficult problems, and superior efficiency compared to both prior open- and closed-source formal provers (Lin et al., 5 Aug 2025). The Goedel-Prover-V2 pipeline and data assets are publicly released, enabling broad, reproducible research in formal reasoning.

1. Innovations in Training and Data Synthesis

Goedel-Prover-V2 incorporates three primary technical innovations:

Scaffolded Data Synthesis:

To address data scarcity, Goedel-Prover-V2 generates new tasks via explicit scaffolding of problem difficulty. For example, it constructs subproblems by analyzing failed proof attempts (using mechanisms such as the extract_goal tactic) and augments the training dataset with both positive and negative instances, including those obtained by negating unsolved goals. Additionally, a companion LLM (e.g., Qwen3-32B) paraphrases informal versions of mathematical statements, which the system then formalizes automatically via a dedicated formalizer. This dual strategy exposes the model to a dense curriculum of increasingly complex and diverse formal problems (Lin et al., 5 Aug 2025).

Verifier-Guided Self-Correction:

A verifier-in-the-loop approach is employed post-generation: after producing a candidate proof, the Lean compiler checks it, with any resulting error messages being parsed and provided as corrective feedback. The model then locally revises segments corresponding to the failing subgoals, iterating until verification is achieved or a sampling budget is exhausted. The error-driven repair sequence not only increases pass rates by ≈2 percentage points but also regularizes the model toward robust failure recovery. This self-correction interleaves chained reasoning and tightly couples proof search with external validation.

Model Averaging:

To preserve performance across a range of sampling budgets and counteract output homogenization after deep fine-tuning or RL (as evidenced by pass@N flattening), Goedel-Prover-V2 periodically merges model weights. Formally, for parameters θ₀ (base) and θ (fine-tuned), the final checkpoint is Combined_Params = (1–α)·θ₀ + α·θ for α ∈ (0,1). This increases proof diversity without sacrificing peak accuracy and is especially effective for large N in top-k inference (Lin et al., 5 Aug 2025).

2. Model Architecture and Scaling

Goedel-Prover-V2 leverages LLMs pretrained for chain-of-thought mathematical reasoning, then fine-tunes them first under a supervised objective on a curated, iteratively-expanded proof dataset. Architecturally:

  • Variants: A small model at 8B parameters and a flagship at 32B parameters, both using long sequence handling (30–40K tokens) to capture whole proofs and their self-corrections.
  • Training Pipeline: The iterative loop comprises: SFT (on whole-proof data and scaffold-generated problems), RL (using a hybrid GRPO-based objective and verifier feedback), and periodic model averaging.
  • Data Composition: The system is trained on a mix of (1) synthetic scaffolded tasks, (2) formalized natural language math (via formalizer LLMs), (3) expert iteration proofs, and (4) Lean compiler-driven self-correction trajectories.

This design achieves high sample efficiency—successfully generating long, structured formal proofs "at once" for a large fraction of evaluation problems.

3. Benchmark Performance and Empirical Results

Goedel-Prover-V2 exhibits leading performance on established Lean 4 benchmarks:

Model MiniF2F (pass@32) MiniF2F (pass@N, N>1000) PutnamBench (pass@32) PutnamBench (pass@184)
8B 84.6%
32B (base) 88.1% 43 86
32B (self-correct) 90.4% 57 86
DeepSeek-Prover-V2-671B 82.4% 47 (pass@1024)

Compared to DeepSeek-Prover-V2-671B (671B parameters), the 32B Goedel-Prover-V2 outperforms with only 1/20th of the parameter count and a greatly reduced compute budget. For reference, the self-correction mode boosts the pass@32 metric by ≈2 percentage points.

On PutnamBench, the Goedel-Prover-V2-32B solves 86 problems at pass@184; the closest competitor, DeepSeek-Prover-V2-671B, solves 47 at pass@1024 (Lin et al., 5 Aug 2025).

4. Data and Open-Source Release

Goedel-Prover-V2 is fully open-sourced, including:

  • Model Weights: 8B and 32B parameter checkpoints [https://github.com/Goedel-LM/Goedel-Prover-V2].
  • Training Codebase: Scaffolded data synthesis, RL, and self-correction routines.
  • Curated Datasets: Synthesized mathematical problems, formalized statements, and verification trajectories.

This level of transparency supports reproducibility and further experimentation, with clear documentation for scaling to additional benchmarks or integrating domain-specific axioms.

5. Comparison to Other Formal Provers

Goedel-Prover-V2 outperforms both prior open-source and several reported closed-source systems (across pass@N metrics and compute budgets). Key technical differences:

  • The verifier-guided self-correction loop is uniquely effective, as most prior systems ignore systematic proof repair.
  • Scaffolded data synthesis outpaces expert iteration approaches reliant exclusively on existing human-verified corpora or uninformed sampling.
  • Model averaging across checkpoints maintains high variance in outputs, bolstering performance even with increased sampling—an issue that plagues many strictly RL-based pipelines.

Under restricted sampling (e.g., pass@32), Goedel-Prover-V2 achieves state-of-the-art status on formal Lean mathematics across high-school and undergraduate domains, and continues to scale efficiently with model size.

6. Limitations and Open Challenges

Despite its strong sample efficiency and overall accuracy, Goedel-Prover-V2 shows weaknesses on benchmarks such as Ineq-Comp (Zhao et al., 19 May 2025) that evaluate compositional reasoning—e.g., combining and reusing known facts about inequalities in nontrivial ways. While it succeeds at solving basic instances, accuracy on transformed or composed variants drops precipitously (often from >50% to <5%). This highlights a gap between robust formalization of elementary tactics and automation of human-inuitive mathematical structure reuse.

Ongoing research aims to address these limitations by:

  • Incorporating tasks into the training set that explicitly demand compositional and high-level strategic reasoning.
  • Developing proof planning and decomposition modules to break down complex problems.
  • Further improving the interface between error-driven self-correction and higher-level symbolic search.

7. Implications and Future Directions

Goedel-Prover-V2 sets a new standard for open-source, scalable formal theorem proving. Its methodological advances (scaffolded data, verifier feedback, model averaging) enable unprecedented efficiency and accuracy, while detailed code and dataset releases fuel continued progress in the community. As a platform, it supports scalable research not only in mathematics but also in formal methods, logic, and AI-driven reasoning systems. Remaining challenges—such as compositional generalization—define the next frontiers for automated reasoning and LLM application in mathematical domains.