CoqPilot, a plugin for LLM-based generation of proofs

Published 25 Oct 2024 in cs.SE, cs.AI, and cs.LO | (2410.19605v1)

Abstract: We present CoqPilot, a VS Code extension designed to help automate writing of Coq proofs. The plugin collects the parts of proofs marked with the admit tactic in a Coq file, i.e., proof holes, and combines LLMs along with non-machine-learning methods to generate proof candidates for the holes. Then, CoqPilot checks if each proof candidate solves the given subgoal and, if successful, replaces the hole with it. The focus of CoqPilot is twofold. Firstly, we want to allow users to seamlessly combine multiple Coq generation approaches and provide a zero-setup experience for our tool. Secondly, we want to deliver a platform for LLM-based experiments on Coq proof generation. We developed a benchmarking system for Coq generation methods, available in the plugin, and conducted an experiment using it, showcasing the framework's possibilities. Demo of CoqPilot is available at: https://youtu.be/oB1Lx-So9Lo. Code at: https://github.com/JetBrains-Research/coqpilot

Abstract PDF HTML Upgrade to Chat

References (20)

Summary

The paper introduces CoqPilot, a plugin that integrates LLMs with traditional methods to automate and enhance Coq proof generation.
The methodology combines premise selection, multiple LLMs, and the Coq-LSP to achieve notable proof success, reaching up to 51% on a 300-theorem dataset.
The results demonstrate significant time savings in proof writing and pave the way for advanced, machine-assisted formal verification practices.

Evaluation of CoqPilot: A Plugin for LLM-based Proof Generation

The paper introduces CoqPilot, a VS Code plugin developed to aid in automating the writing of Coq proofs. Coq, being an interactive theorem prover, relies heavily on user interaction for constructing formal proofs through tactics. This paper details CoqPilot's design, focusing on enhancing proof generation by leveraging LLMs alongside other methodologies.

Overview and Methodology

CoqPilot is engineered to address the complexity and time demands of generating formal proofs in Coq. It acts as a versatile tool that combines LLMs with traditional methods such as Tactician and CoqHammer. The integration of LLMs like GPT-4o and GPT-3.5 is particularly significant; CoqPilot optimizes their performance through premise selection, providing context-theorems to the LLMs. This improves the LLMs' ability to generate proofs by structuring the contextual input and guiding their outputs.

The plugin focuses on two primary goals: firstly, to deliver a seamless setup for users by combining multiple proof generation approaches; secondly, to facilitate LLM-based experimentation in Coq proof generation. To this end, CoqPilot also implements automatic proof validation using the Coq Language Server Protocol (Coq-LSP), enabling the framework to identify and rectify failed attempts.

Empirical Results and Performance

The evaluation uses the IMM dataset, focusing on a subset of 300 theorems with varying proof lengths. CoqPilot's experimentation reveals distinct improvements in LLM-aided proof generation. For instance, GPT-4o, when optimized within CoqPilot, correctly proves 34% of the sample set. The collective effect of all models within CoqPilot reaches 39%, illustrating the benefits of using multiple methods. Notably, this is further enhanced when combined with traditional tools, achieving a 51% success rate.

These empirical results underscore CoqPilot's efficacy. The ability to improve proof generation success using a blend of techniques highlights the platform's potential to advance formal verification practices significantly.

Implications and Future Prospects

The implications of CoqPilot's development are multifaceted. Practically, it streamlines the proof-writing process, reducing the time and expertise required while maintaining the rigorous standards of formal verification. Theoretically, the integration of LLMs into theorem proving opens new avenues for machine-assisted logical reasoning.

Future research could explore further optimization of LLM integration and expand support to other interactive theorem provers. Broadening the dataset and refining the premise selection process could enhance the generalizability and robustness of CoqPilot.

Conclusion

CoqPilot represents a significant advancement in automating the Coq proof-writing process. By effectively merging LLMs with established methodologies, it provides a powerful tool that promises to expand the potential of formal software verification. The plugin's modularity and extensibility set a foundation for further innovation and integration of AI in formal logic environments.

Markdown