TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts (2407.03203v2)

Published 3 Jul 2024 in cs.FL and cs.AI

Abstract: Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using LLMs based on Natural Language (NL) proofs. However, due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data most modern LLMs exhibit suboptimal performance.This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address these challenges, this paper proposes TheoremLlama, an end-to-end framework that trains a general-purpose LLM to be a Lean4 expert. TheoremLlama includes NL-FL dataset generation and bootstrapping method to obtain aligned dataset, curriculum learning and block training techniques to train the model, and iterative proof writing method to write Lean4 proofs that work together synergistically. Using the dataset generation method in TheoremLlama, we provide Open Bootstrapped Theorems (OBT), an NL-FL aligned and bootstrapped dataset. Our novel NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leverages the NL reasoning ability of LLMs for formal reasoning. The TheoremLlama framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. Our code, model checkpoints, and the generated dataset is published in GitHub

PDF HTML Abstract

Insights into TheoremLlama: Training LLMs for Lean4 Theorem Proving

Recent advancements in the domain of formal mathematics have prompted the exploration of automated theorem proving using Formal Languages (FL) such as Lean. In this context, the paper titled “TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts” proposes a novel framework for harnessing LLMs to improve theorem proving in Lean4, thereby addressing the inherent challenges in existing practices.

Framework and Methodologies

TheoremLlama introduces an end-to-end framework aimed at enhancing the abilities of general-purpose LLMs to proficiently write and reason in Lean4. The framework is built upon three principal components:

NL-FL Aligned Data Generation: Recognizing the scarcity of aligned data to train LLMs for Lean4 tasks, the authors devise a comprehensive data generation method. They utilize Mathlib4, amassing a dataset of 100k theorems, and apply informalization techniques using the Gemini-1.5 model. This approach centers on writing natural language proofs from formal proofs, leveraging in-context learning and bootstrapping to construct the Open Bootstrapped Theorems (OBT) dataset. Here, natural language proofs are integrated within Lean4 code through comments, fostering a bidirectional understanding between natural and formal language reasoning.
Lean4 Prover Training: The paper innovatively employs block training to enhance in-context learning and introduces curriculum data sorting to enable models to learn progressively from simple to complex tasks. This approach helps mitigate the disruptive effects of more intricate examples during early training phases and aligns the learning trajectory more closely with the model’s capacity.
Iterative Proof Writing: Emphasizing the iterative nature of learning, the framework harnesses previously successful theorem proofs as additional examples, thereby refining the model's ability to generalize while proving new theorems. This iterative strategy substantiates the alignment of model training with the dynamic and complex nature of formal theorem proving.

Numerical Results and Implications

TheoremLlama demonstrates substantial improvements over established baselines. Achieving cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets, respectively, it outperforms other methods significantly, including tree-search methods like Expert Iteration and few-shot approaches with state-of-the-art LLMs like GPT-4. This striking result underscores the efficacy of structured training regimens tailored to Lean4's peculiarities and highlights the potential of strategically augmented natural language guidance.

Broader Implications and Future Directions

The strong performance of TheoremLlama on formal proof generation tasks underlines several key insights. First, it demonstrates the enduring value of natural language and formal reasoning integration, providing a template for advancing efforts in complex reasoning tasks. Additionally, the framework's success hints at a broader applicability to other formal languages and even broader domains necessitating structured reasoning, such as AI-driven code synthesis and legal document processing.

The paper raises stimulating questions on the potential for enhancing LLMs' domain-specific reasoning capabilities further. It propels future exploration into more nuanced interactions between Lean4 constructs and natural language, possibly leveraging Reinforcement Learning for real-time proof refinement through direct feedback from Lean4 environments. Moreover, addressing the nuanced challenges of interpreting complex natural language proofs into formal languages could significantly augment theorem proving's reach and application.

In conclusion, TheoremLlama's ambitious framework, which bridges the gap between natural language expressions and formal theorem proving, represents a significant advance in automated theorem proving methodologies. It exemplifies how thoughtful merging of large-scale LLMs with domain-specific training paradigms can propel capabilities into new frontiers of formal reasoning.