Lean-STaR: Learning to Interleave Thinking and Proving (2407.10040v3)

Published 14 Jul 2024 in cs.AI

Abstract: Traditional LLM-based theorem proving assumes that by training on a sufficient amount of formal proof data, a model will learn to prove theorems. Our key observation is that a wealth of informal information that is not present in formal proofs can be useful for learning to prove theorems. For instance, humans think through steps of a proof, but this thought process is not visible in the resulting code. We present Lean-STaR, a framework for training LLMs to produce informal thoughts prior to each step of a proof, thereby boosting the model's theorem-proving capabilities. Lean-STaR uses retrospective ground-truth tactics to generate synthetic thoughts for training the LLM. At inference time, the trained model directly generates the thoughts prior to the prediction of the tactics in each proof step. Building on the self-taught reasoner framework, we then apply expert iteration to further fine-tune the model on the correct proofs it samples and verifies using the Lean solver. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment, significantly outperforming base models ($\boldsymbol{43.4\% \rightarrow 46.3\%,}$ Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness.

PDF HTML Abstract

Lean-STaR: Learning to Interleave Thinking and Proving

Lean-STaR presents a novel framework aimed at enhancing the theorem-proving capabilities of LLMs by leveraging informal "thoughts" prior to each step of a proof. Traditional methods in language-model-based theorem proving focus exclusively on training models using formal proof data. Lean-STaR deviates from this norm by incorporating natural language rationales to bridge the gap between formal and informal mathematics.

Key Contributions

Informal Thought Integration: Lean-STaR generates synthetic thoughts that act as intermediate steps before each formal tactic is applied. This extends the Self-Taught Reasoner (STaR) framework to train LLMs not only on tactics but also on the rationale behind each logical step.
Expert Iteration: The framework involves fine-tuning the initial thought-augmented model through multiple iterations of expert learning. The model samples correct proofs, verifies them using the Lean theorem prover, and includes these proofs in its training data to iteratively enhance its performance.
Synthetic Data Generation: Approximately 50,000 thought-augmented examples were created using retrospective ground-truth tactics from human-written proofs in Lean's Mathlib. Further data were synthesized via expert iteration to improve the model continually.

Numerical Results

Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark for Lean theorem proving. The Pass@64 metric improved from 43.4% to 46.3%, showcasing the efficacy of interleaving informal thoughts with formal proof steps.

Implications and Future Directions

Practical Implications:

Automated Theorem Proving: Lean-STaR advances the field by demonstrating that informal intermediary steps can significantly improve theorem-proving models, making them more reliable for formal verification tasks across mathematics and software engineering.
Error Detection: By formalizing informal thought processes, Lean-STaR can assist in identifying errors in existing proofs more efficiently, as exemplified by Terence Tao's discovery using Lean.

Theoretical Implications:

Cognitive Emulation: The approach mimics human cognitive processes where informal reasoning aids in complex problem solving. This demonstrates the potential to enhance machine understanding of formal systems through informal context.
Data Augmentation: This research highlights the potential for synthetic data, generated through an intelligent combination of formal and informal elements, to improve model accuracy without extensive manual annotation.

Future Developments in AI:

Extended Frameworks: Future work could extend this framework to other formal systems beyond Lean, such as Coq and Isabelle, by incorporating various sources of informal mathematical knowledge.
Scalability: Increasing the scale of thought-augmented datasets and iterations may further boost performance, potentially reaching human-level proficiency in theorem proving.
Interdisciplinary Applications: The methodologies developed could be applied to other fields requiring logical reasoning, such as legal document analysis or complex planning tasks in robotics and AI.

Conclusion

Lean-STaR sets a new precedent in automated theorem proving by interleaving informal and formal methods. By capturing the inherent reasoning behind each proof step, Lean-STaR significantly advances the capabilities of LLMs in formal mathematics. This integrated approach not only paves the way for more robust automated proof systems but also bridges a critical gap between human and machine reasoning.