Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving (2502.07640v3)

Published 11 Feb 2025 in cs.LG and cs.AI

Abstract: We introduce Goedel-Prover, an open-source LLM that achieves state-of-the-art (as of April 5 2025) performance in automated formal proof generation for mathematical problems. A key challenge in this field is the scarcity of formalized mathematical statements and proofs, which we address through the following approaches. First, we train LLMs to convert natural language math problems from the Numina dataset to equivalent formal statements in Lean 4. This process creates the dataset Goedel-Pset-v1, which includes 1.64 million formal statements. Next, we develop a large dataset of formal proofs by training a series of provers. Each new prover can prove many statements that previous ones could not, and these new proofs are added to the training set for the next prover. Finally, we obtain the dataset Goedel-Pset-v1-solved, which contains proofs for over 800K statements from Goedel-Pset-v1. Supervised fine-tuning (SFT) of DeepSeek-Prover-V1.5-Base on Goedel-Pset-v1-solved (i.e., no RL) yields a Goedel-Prover-SFT that achieves a success rate of 57.6% (Pass@32) on miniF2F, surpassing the previous leader DeepSeek-Prover-V1.5-RL (trained using SFT + RL on a proprietary dataset) by 7.6%. On PutnamBench, Goedel-Prover-SFT successfully solves 7 problems (Pass@512), ranking first on the leaderboard. We provide extensive discussion of our training methodology, highlighting the key design choices that contribute to Goedel-Prover's strong performance. Further RL training (including DPO) improves Goedel-Prover-SFT's success rate to over 60% (Pass@32) on miniF2F. To aid future research, we provide extensive discussion of our training methodology and design choices. We also fully open-source our codes, models, and datasets. Additionally, we open-source formal proofs for 29.7K problems in Lean Workbook, nearly doubling the 15.7K solved by prior provers.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Yong Lin (77 papers)
  2. Shange Tang (11 papers)
  3. Bohan Lyu (12 papers)
  4. Jiayun Wu (16 papers)
  5. Hongzhou Lin (16 papers)
  6. Kaiyu Yang (24 papers)
  7. Jia Li (380 papers)
  8. Mengzhou Xia (34 papers)
  9. Danqi Chen (84 papers)
  10. Sanjeev Arora (93 papers)
  11. Chi Jin (90 papers)

Summary

Overview of the Goedel-Prover Model for Automated Theorem Proving

The paper presents Goedel-Prover, an open-source LLM designed to advance the state-of-the-art (SOTA) in automated formal proof generation. The model specifically targets mathematical problems expressed in formal languages, with a focus on registered success in whole-proof generation. By accomplishing a 57.6% success rate (Pass@32) on the miniF2F benchmark, Goedel-Prover demonstrates significant improvement over previously established models, outperforming the best existing open-source model by 7.6%.

Challenges and Solutions

A notable challenge in training LLMs for theorem proving lies in the limited availability of formalized mathematical statements and proofs. To address this, the team has implemented several novel methodologies:

  1. Data Formalization: Training statement formalizers for translating natural language math problems into Lean formal language allowed for the generation of a database of 1.64 million formal statements. This translation process was essential in bridging the gap between informal reasoning and formal reasoning, which machines can verify.
  2. Iterative Proof Generation: A powerful aspect of this research lies in iteratively building a large dataset of formal proofs, starting with limited data. By training a series of provers, each being tasked with proving statements the previous model could not, the dataset's robustness continually increased with each iteration. Each prover's success only enhanced the training set further, culminating in the final model exhibiting remarkable whole-proof generation capabilities.

Numerical Results

The Goedel-Prover yields strong numerical outcomes:

  • On the miniF2F benchmark, it achieves a 57.6% success rate (Pass@32), showcasing a marked improvement over previous models.
  • The model emerged first on the PutnamBench, successfully solving 7 problems (Pass@512), which demonstrates its efficacy against complex benchmarks.
  • In terms of generating new formal proofs for Lean Workbook problems, Goedel-Prover produces 29.7K formal proofs, a near doubling of the previous contribution of 15.7K proofs.

Implications and Future Directions

The Goedel-Prover model sets a new standard in AI-driven formal reasoning and theorem proving by integrating large-scale data synthesis and iterative model training. By bypassing the need for interaction with the Lean compiler for proof validation, the model attains lower latency and avoids extensive computation resources during testing.

Nevertheless, the implications of the proof style adopted by Goedel-Prover merit further investigation. The model's tendency to utilize high-level tactics that perform complex reasoning may not entirely capture the depth required in specific problem contexts. Future work could explore enhancing proof granularity and incorporating search algorithms and online interaction to further improve performance. Additionally, there is potential in integrating symbolic computation tools like SymPy to extend the range of solvable problems, especially those involving complex algebraic simplification.

Overall, Goedel-Prover represents a substantial step in automated theorem proving capabilities, offering a blend of innovative data strategies with effective use of LLMs. The open-sourcing of the code and models provides an invaluable resource for continued research and development in this burgeoning field.