Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction (2508.03613v1)

Published 5 Aug 2025 in cs.LG and cs.AI

Abstract: We introduce Goedel-Prover-V2, a series of open-source LLMs that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models--including closed-source systems with publicly reported performance--under a constrained test-time compute budget. Our models, code, and data are released at https://github.com/Goedel-LM/Goedel-Prover-V2.

Summary

The paper introduces an open-source theorem proving system that integrates verifier-guided self-correction and scaffolded data synthesis.
It demonstrates that even smaller models (8B) can outperform larger counterparts through iterative self-correction and strategic model averaging.
Benchmark results on MiniF2F and PutnamBench highlight significant performance gains, with self-correction consistently boosting proof accuracy.

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

Introduction

Goedel-Prover-V2 presents a series of open-source LLMs for automated theorem proving in Lean, achieving state-of-the-art performance with significantly reduced model size and computational requirements. The system is designed to generate complete formal proofs and iteratively refine them using verifier feedback, integrating innovations in data synthesis, self-correction, and model averaging. The flagship 32B model achieves 88.1% pass@32 on MiniF2F and 90.4% with self-correction, outperforming previous models such as DeepSeek-Prover-V2-671B and Kimina-Prover-72B, while the 8B variant surpasses DeepSeek-Prover-V2-671B despite being 80 times smaller.

Figure 1: Performance of Goedel-Prover-V2 on different benchmarks under pass@32.

Framework Innovations

Verifier-Guided Self-Correction

Goedel-Prover-V2 formalizes the integration of Lean compiler feedback into the proof generation loop. After an initial proof attempt, verification failures are parsed and fed back to the model, which then generates targeted repairs. This iterative self-correction process leverages error messages and tactic outcomes, enabling the model to diagnose and fix errors in long chain-of-thought (CoT) reasoning. Ablation studies confirm that compiler feedback is essential for effective revision, with removal of error messages resulting in significant performance degradation.

Scaffolded Data Synthesis

Formal-Based Synthesis

When the prover fails on a challenging problem, the $extract\_goal$ tactic in Lean is used to capture unsolved subgoals, which are then formalized as new, simpler statements. Both the extracted statements and their negations are added to the training set, enhancing the model's ability to distinguish true and false propositions.

Informal-Based Synthesis

LLMs are prompted to generate simpler subproblems or harder variants in natural language, which are then formalized into Lean statements. Quality control is enforced via LLM-based filters for correctness and difficulty, discarding trivial or incorrect statements and adding negations where appropriate. This pipeline accelerates data augmentation and ensures a diverse, high-quality training set.

Figure 2: Our informal-based scaffolded data synthesis pipeline with three parts: (1) informal statement generation; (2) formalization and quality checking; and (3) negation and difficulty filtering.

Training Pipeline

The training process follows expert iteration, alternating between large-scale inference, supervised fine-tuning (SFT), and reinforcement learning (RL). Model averaging is applied after SFT and RL to mitigate diversity collapse, using a convex combination of base and fine-tuned model parameters. RL is implemented in a multi-task setup, optimizing both whole-proof generation and first-round self-correction, with dynamic sampling focused on problems of intermediate difficulty.

Figure 3: The overall workflow of model training. +AVG" denotes that the trained model is averaged with the base model after training.RL-AVG" is the final output model.

Evaluation and Results

Benchmarks

Goedel-Prover-V2 is evaluated on MiniF2F, PutnamBench, and MathOlympiadBench, covering high-school and college-level competition problems across diverse mathematical domains.

Figure 4: Distribution of problems in MathOlympiadBench by category.

Main Results

MiniF2F: Goedel-Prover-V2-32B achieves 88.1% pass@32, rising to 90.4% with self-correction. The 8B model attains 84.6%, outperforming DeepSeek-Prover-V2-671B.
PutnamBench: The 32B model solves 43 problems at pass@32, 57 with self-correction, and 86 at pass@184, surpassing DeepSeek-Prover-V2-671B's record of 47 at pass@1024.
Sample Efficiency: High pass@N is achieved with minimal inference overhead, indicating strong internalization of reasoning strategies.

Scaling Analysis

Goedel-Prover-V2 demonstrates robust scaling behavior, maintaining superior accuracy across inference budgets. Self-correction consistently provides a 2-point gain in pass@32 and pass@64, with extended context and more revision iterations further improving sample efficiency.

RL and Model Averaging

Model averaging enhances diversity and pass@N, with optimal ratios maximizing performance. RL steps increase pass@1, while correction settings benefit more from RL due to the scarcity of high-quality self-correction data in SFT.

Implications and Future Directions

Goedel-Prover-V2 establishes that state-of-the-art formal theorem proving is achievable without massive models or proprietary infrastructure. The integration of verifier-guided self-correction with long CoT reasoning sets a new paradigm for efficient, scalable ATP. The open-source release of models, code, and data provides a foundation for further research in formal reasoning, proof repair, and inference-time scaling strategies.

The approach suggests several avenues for future work:

Enhanced proof repair strategies leveraging subgoal extraction and targeted correction.
Exploration of multi-turn RL and tool-use protocols for more complex revision loops.
Extension to other formal systems and domains beyond Lean.

Conclusion

Goedel-Prover-V2 advances the frontier of automated theorem proving by combining scaffolded data synthesis, verifier-guided self-correction, and model averaging within a rigorous training pipeline. The models achieve state-of-the-art results on major benchmarks with modest computational resources, demonstrating that efficient, high-performance formal reasoning is attainable in open-source settings. The release of Goedel-Prover-V2 is poised to accelerate progress in AI-driven formal mathematics and provide a robust platform for future innovations.

PDF Markdown

Follow-up Questions

Related Papers

Authors (20)

First 10 authors:

GitHub

GitHub - Goedel-LM/Goedel-Prover-V2 (56 stars)

Tweets

https://twitter.com/chijinML/status/1953057068258095391

https://twitter.com/jiqizhixin/status/1952986720208863511

https://twitter.com/Jose_A_Alonso/status/1953507630409732331

https://twitter.com/arxivsanitybot/status/1953092640590684250

https://twitter.com/RevanthAtmakuri/status/1953679265641173147

alphaXiv

Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction (19 likes, 0 questions)