Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving (2502.00212v4)

Published 31 Jan 2025 in cs.LG, cs.AI, and cs.LO

Abstract: A fundamental challenge in formal theorem proving by LLMs is the lack of high-quality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 51.3 billion tokens generated during the training in Lean, STP proves 28.5% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (65.0%, pass@3200), Proofnet-test (23.9%, pass@3200) and PutnamBench (8/644, pass@3200). We release our code, model, and dataset in this URL: https://github.com/kfdong/STP.

Summary

  • The paper presents a dual-role LLM system that functions as both conjecturer and prover to mitigate training data scarcity in theorem proving.
  • It introduces a three-stage methodology—supervised finetuning, iterative self-play, and final retraining—that enhances proof generation capabilities.
  • Experiments using Lean and Isabelle demonstrate significant improvements, with Lean success rates increasing from 13.2% to 26.3%.

Overview of Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

The paper "Beyond Limited Data: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving" addresses a significant challenge in the domain of formal theorem proving using LLMs—the scarcity of high-quality training data. Traditional methods, such as reinforcement learning (RL) or expert iteration, attempt to mitigate data scarcity by iterating between generating proofs and fine-tuning on correct ones; however, they often plateau due to the limited availability of correct proofs.

The authors present an innovative approach inspired by human mathematicians, who not only solve existing problems but also generate new conjectures. The introduction of a Self-play Theorem Prover (STP) represents a dual-role system where the LLM operates as both a conjecturer and a prover. This system creates an environment where conjectures and proofs serve as self-generated training datasets, facilitating continuous learning without additional external data.

Methodology

The STP model is constructed to perform in three stages:

  1. Model Initialization via Supervised Finetuning (SFT): The LLM is fine-tuned to play dual roles. The prover is trained on existing theorem-proof pairs to learn proof generation, while the conjecturer is exposed to a subset of known results to encourage the generation of novel conjectures.
  2. Self-play Training: This stage involves iterative and interactive learning between the conjecturer and prover. The conjecturer generates new, related conjectures based on a seed theorem and its proof, while the prover attempts to prove both existing statements and newly generated conjectures. Successful proofs provide feedback to improve both roles, with an emphasis on conjectures that are neither too easy nor impossible to prove (i.e., having a low but positive pass rate).
  3. Final Re-training: To ensure stability and effectiveness, the final model is retrained from the base model using proofs collected throughout the STP iterations, consolidating learning advancements.

Results and Implications

Experimentation with Lean and Isabelle formal verifiers yielded notable results. With Lean, STP proved 26.3% of the LeanWorkbook dataset statements, significantly outperforming the previous best result of 13.2%. On benchmarks like miniF2F-test and ProofNet-test, STP reached state-of-the-art performance among whole-proof generation methods, including in environments with different sampling budgets.

These findings suggest that STP can efficiently scale and enhance theorem proving capabilities by maintaining diverse and dynamically challenging datasets. This model opens up new potential for AI development in reasoning tasks beyond current dataset limitations, potentially impacting AI's role in formal mathematics and broader applications requiring complex reasoning.

Observations and Future Directions

The approach of integrating self-generated conjectures as a continuous source of challenging tasks marks a significant shift in how LLMs are trained for theorem proving. The adaptability of STP suggests a promising way to handle sparse data environments, fundamental for applications seeking to advance beyond current knowledge boundaries.

Future research could explore refining the conjecturer's ability to generate more balanced and topic-diverse conjectures, optimizing for long-term improvements in prover capabilities. Further investigation into STP's application to natural language theorem translation and broader AI-driven conjecturing might also yield advancements in AGI research. Expanding this methodology into other formal proof systems or integrating with step-based prover models could provide additional enhancement avenues.

In conclusion, the work highlights strategic innovation in AI for mathematical reasoning, underscoring the significance of self-generated tasks in data-limited contexts and paving the way for autonomous reasoning systems in unexplored realms of mathematics and beyond.