LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover (2407.17227v1)

Published 24 Jul 2024 in cs.AI and cs.CL

Abstract: Recently, LLMs have presented promising results in aiding formal mathematical reasoning. However, their performance is restricted due to the scarcity of formal theorem-proving data, which requires additional effort to be extracted from raw formal language corpora. Meanwhile, a significant amount of human-written formal language corpora remains underutilized. To address this issue, we propose LEAN-GitHub, a dataset consisting of large-scale formal data extracted from almost all Lean 4 repositories on GitHub. After fine-tuning InternLM-math-plus on this dataset, our model achieved accuracies of 48.8% with a single pass and 54.5% with 64 passes on the Lean 4 miniF2F test, surpassing state-of-the-art method at 52%. And it also achieves state-of-the-art on two other Lean 4 benchmarks (ProofNet and Putnam) targeting different fields/levels of math. These results demonstrate that our proposed dataset is beneficial for formal reasoning on a wide range of math topics. We open-source our model at https://GitHub. com/InternLM/InternLM-Math and our data at https://huggingface.co/ datasets/InternLM/Lean-GitHub

PDF HTML Abstract

Compiling GitHub LEAN Repositories for a Versatile LEAN Prover

The paper "\dataset{}: Compiling GitHub LEAN repositories for a versatile LEAN prover" addresses the pressing issue of data scarcity in formal theorem proving, particularly in automated theorem proving (ATP) using LLMs. Authored by Zijian Wu, Jiayu Wang, Dahua Lin, and Kai Chen, this work emphasizes the potential of underutilized human-written formal language corpora in enhancing the performance of theorem provers.

Overview and Contributions

The primary contribution of this paper is the creation of the \dataset{} dataset, which comprises a substantial collection of formal reasoning data extracted from Lean 4 repositories on GitHub. The dataset encompasses 28,597 theorems and 218,866 tactics. This dataset serves as a significant augmentation to existing datasets like Mathlib, facilitating advancements in automated theorem proving. Importantly, the dataset also addresses the state duplication problem common in tree proof search methods, enhancing computational efficiency.

Additionally, the authors present the \prover model, a 7B parameter model fine-tuned on the \dataset{} dataset using the InternLM-math-plus framework. The model achieves state-of-the-art performance across multiple Lean 4 benchmarks, showcasing its efficacy in formal reasoning tasks.

Methodology

Data Extraction

The authors highlight the inadequacies of existing data extraction tools such as LeanDojo and LeanStep, which are primarily designed for single projects. By constructing a scalable pipeline, the authors managed to extract data from 147 Lean 4 repositories. The extraction process involved direct compilation using the leanc compiler to handle isolated files and projects that were not compliant Lean projects. This method allowed for increased extraction efficiency and parallelism.

Dataset Statistics

The paper provides a comprehensive comparison of the \dataset{} dataset with other existing datasets like Lean-Workbook, Deepseek-Prover, and LeanDojo-Mathlib. This comparison underscores the diverse mathematical topics and the substantial volume of data captured in the \dataset{} dataset.

Experimental Results

MiniF2F and ProofNet

The \prover model was rigorously tested on the miniF2F and ProofNet benchmarks. It significantly outperforms existing models, with an accuracy of 48.8% on a single pass and 54.5% on 64 passes on the miniF2F test, surpassing the state-of-the-art accuracy of 52%. On ProofNet, \prover achieves a Pass@1 rate of 18.1%, again demonstrating superior performance compared to prior models.

PutnamBench

The \prover model also achieved notable results on PutnamBench, solving 5 out of 640 problems in a single pass. This is particularly impressive given that previous models like DeepSeek-Prover managed to solve only 4 problems with a pass@10 setting.

Implications

Practical Implications

The practical implications of this work are profound. By providing a large-scale, well-extracted human-written dataset, the authors have significantly contributed to the field of formal reasoning and automated theorem proving. The dataset and model have opened up new avenues for research, facilitating the development of more sophisticated and accurate theorem provers.

Theoretical Implications

Theoretically, this work underscores the importance of leveraging human-written data for training formal reasoning models. It also highlights the efficacy of combining human-written data with synthetic data to improve model performance across various fields and difficulty levels. The advancements in addressing the state duplication problem in tree proof search methods further contribute to the literature.

Future Directions

Given the substantial improvements demonstrated by \prover, future research could focus on further expanding the dataset by incorporating more repositories and possibly other formal languages like Coq and Isabelle. Additionally, exploring the integration of informal proofs into formal proofs could be another promising direction. Further optimization of the data extraction and compilation processes could also enhance extraction efficiency and model training.

Conclusion

The paper successfully demonstrates that the \dataset{} dataset is a valuable resource for enhancing formal reasoning models. By fine-tuning the \prover model on this dataset, the authors have achieved state-of-the-art performance across multiple benchmarks, thereby pushing the boundaries of automated theorem proving. The release of both the dataset and the model promises to foster further advancements in the field, facilitating more robust and versatile formal reasoning systems.