Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on Coq Code (2403.12627v2)
Abstract: In the realm of formal theorem proving, the Coq proof assistant stands out for its rigorous approach to verifying mathematical assertions and software correctness. Despite the advances in artificial intelligence and machine learning, the specialized nature of Coq syntax and semantics poses unique challenges for LLMs. Addressing this gap, we present a comprehensive dataset specifically designed to enhance LLMs' proficiency in interpreting and generating Coq code. This dataset, derived from a collection of over 10,000 Coq source files, encompasses a wide array of propositions, proofs, and definitions, enriched with metadata including source references and licensing information. Our primary aim is to facilitate the development of LLMs capable of generating syntactically correct and semantically meaningful Coq constructs, thereby advancing the frontier of automated theorem proving. Initial experiments with this dataset have showcased its significant potential; models trained on this data exhibited enhanced accuracy in Coq code generation. Notably, a particular experiment revealed that a fine-tuned LLM was capable of generating 141 valid proofs for a basic lemma, highlighting the dataset's utility in facilitating the discovery of diverse and valid proof strategies. This paper discusses the dataset's composition, the methodology behind its creation, and the implications of our findings for the future of machine learning in formal verification. The dataset is accessible for further research and exploration: https://huggingface.co/datasets/florath/coq-facts-props-proofs-gen0-v1
- Andreas Florath “LLM Interactive Optimization of Open Source Python Libraries – Case Studies and Generalization”, 2024 arXiv:2312.14949 [cs.SE]
- The Coq Development Team “The Coq Proof Assistant” accessed 2024-02-29 URL: https://coq.inria.fr/
- “Lean” accessed 2024-03-18 URL: https://lean-lang.org
- “Isabelle” accessed 2024-03-18 URL: https://isabelle.in.tum.de
- “StarCoder 2 and The Stack v2: The Next Generation”, 2024 arXiv:2402.19173 [cs.SE]
- “Huggingface Datasets” accessed 2024-03-01 URL: https://huggingface.co/datasets
- “Huggingface Dataset: coq-github-scrape” accessed 2024-02-27 URL: https://huggingface.co/datasets/cassanof/coq-github-scrape
- “Huggingface Dataset: coq-train” accessed 2024-02-27 URL: https://huggingface.co/datasets/metareflection/coq-train
- “Learning to Prove Theorems via Interacting with Proof Assistants” In International Conference on Machine Learning (ICML), 2019
- “CC BY 2.0 LEGAL CODE Attribution 2.0 Generic” accessed 2024-03-01 URL: https://creativecommons.org/licenses/by/2.0/legalcode.en
- “ShareAlike compatibility: GPLv3” accessed 2024-03-01 URL: https://wiki.creativecommons.org/wiki/ShareAlike_compatibility:_GPLv3
- “License Compatibility Review Suggested for Dataset” accessed 2024-03-18 URL: https://github.com/princeton-vl/CoqGym/issues/87
- “Dataset jbb/coq_code” accessed 2024-03-01 URL: https://huggingface.co/datasets/jbb/coq_code
- “Deep Generation of Coq Lemma Names Using Elaborated Terms” In International Joint Conference on Automated Reasoning, 2020, pp. 97–118 DOI: 10.1007/978-3-030-51054-1˙6
- “MathComp Corpus” accessed 2024-03-08 URL: https://github.com/EngineeringSoftware/math-comp-corpus
- “Learning to Format Coq Code Using Language Models”, 2020 arXiv:2006.16743 [cs.HC]
- “Kaggle datasets” accessed 2024-03-01 URL: https://www.kaggle.com/datasets
- “Coq” accessed 2024-03-01 URL: https://github.com/coq/coq
- “Mathematical Components” accessed 2024-03-01 URL: https://github.com/math-comp
- “coq-ext-lib” accessed 2024-03-01 URL: https://github.com/coq-community/coq-ext-lib.git
- “GeoCoq” accessed 2024-03-01 URL: https://github.com/GeoCoq/GeoCoq
- “The Four Color Theorem” accessed 2024-03-01 URL: https://github.com/coq-community/fourcolor.git
- “algebra-tactics” accessed 2024-03-01 URL: https://github.com/math-comp/algebra-tactics.git
- “coqprime” accessed 2024-03-01 URL: https://github.com/thery/coqprime
- “100 famous theorems proved using Coq” accessed 2024-03-01 URL: https://github.com/coq-community/coq-100-theorems.git
- “verdi” accessed 2024-03-01 URL: https://github.com/uwplse/verdi
- “stdpp” accessed 2024-03-07 URL: https://gitlab.mpi-sws.org/iris/stdpp.git
- “Coq Facts, Propositions and Proofs” accessed 2024-03-18 URL: https://huggingface.co/datasets/florath/coq-facts-props-proofs-gen0-v1
- “Mistral 7B”, 2023 arXiv:2310.06825 [cs.CL]
- “CoqLLM-FineTuned-Experiment-Gen0” accessed 2024-03-18 URL: https://huggingface.co/florath/CoqLLM-FineTuned-Experiment-Gen0
- “Gemini: A Family of Highly Capable Multimodal Models”, 2023 arXiv:2312.11805 [cs.CL]
- “GPT-4 Technical Report”, 2024 arXiv:2303.08774 [cs.CL]
- “Stylish Article” Accessed: 2023-11-01, https://www.latextemplates.com/template/stylish-article