Baldur: Whole-Proof Generation and Repair with Large Language Models (2303.04910v2)

Published 8 Mar 2023 in cs.LG, cs.LO, and cs.SE

Abstract: Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use LLMs, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using LLMs for automating formal verification.

Citations (66)

View on Semantic Scholar

Summary

The paper introduces a novel LLM-based approach that generates entire proofs without incremental search, achieving an 8.7% improvement over existing methods.
The study presents a repair model that corrects faulty proofs by analyzing previous failures and associated error messages.
Evaluated on 6,336 Isabelle/HOL theorems, the approach automates proof synthesis, successfully proving 65.7% of the target theorems.

Insightful Overview of the "Baldur: Whole-Proof Generation and Repair with LLMs" Paper

The paper "Baldur: Whole-Proof Generation and Repair with LLMs" presents a novel method to automate formal verification by leveraging LLMs for entire proof generation and repair. It introduces Baldur, a prototype system designed to demonstrate the effectiveness of using LLMs not only for proof synthesis but also for proof repair, reflecting a significant departure from traditional search-based proof generation methodologies.

Main Contributions and Numerical Results

The paper's contributions are multifold:

Whole-Proof Generation: It demonstrates that LLMs, fine-tuned on proofs, can generate entire proofs for theorems at once—bypassing the need for incremental, search-based proof generation. This approach is equally effective as existing methods without incurring the computational costs associated with extensive search.
Proof Repair: Introducing a repair model, the paper showcases the capability of using LLMs for correcting faulty proof attempts. By analyzing prior failed attempts and associated error messages, the repair model uses this context to improve proof generation.
Empirical Evaluation: Baldur is evaluated on a benchmark of 6,336 Isabelle/HOL theorems. Results indicate that the system improves upon the state-of-the-art tool, Thor, achieving an additional 8.7% proof success rate. Together, both systems can automatically prove 65.7% of the theorems tested.

These results underscore the potential of LLMs for formal verification tasks, especially for proof generation and error correction, illustrating that the learned models can handle complex proof synthesis and repair scenarios effectively.

Theoretical Implications

From a theoretical perspective, the paper suggests a paradigm shift in how formal verification can leverage advanced LLMs. The capacity to generate entire proofs without finely granulated orchestration with a search algorithm marks a significant leap in automation potential. It implies a simplification of the proof synthesis pipeline, facilitating more cost-effective and scalable models that focus primarily on leveraging the LLM's generative abilities.

Practical Implications

Practically, Baldur's approach can alleviate the burdensome task of manual proof writing, offering substantial improvements in software reliability and correctness. By integrating repair capabilities, the methodology can accommodate dynamic reasoning strategies inherent in theorem proving, enhancing adaptability and robustness in automated verification.

Future Developments

Looking ahead, the paper anticipates the continued refinement of LLMs in formal verification, particularly:

Exploring more intricate search strategies that integrate proof repair techniques seamlessly.
Developing cross-disciplinary applications wherein similar generative capabilities can be adapted to other domains within automated reasoning and software engineering.
Extending this methodology to other proof assistants, thereby broadening the scope and applicability of LLMs across different formal systems.

In conclusion, the paper presents a formidable advancement in the use of LLMs for formal verification, emphasizing their potential in transforming proof synthesis through their generative and repair capabilities.