Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoqPilot, a plugin for LLM-based generation of proofs

Published 25 Oct 2024 in cs.SE, cs.AI, and cs.LO | (2410.19605v1)

Abstract: We present CoqPilot, a VS Code extension designed to help automate writing of Coq proofs. The plugin collects the parts of proofs marked with the admit tactic in a Coq file, i.e., proof holes, and combines LLMs along with non-machine-learning methods to generate proof candidates for the holes. Then, CoqPilot checks if each proof candidate solves the given subgoal and, if successful, replaces the hole with it. The focus of CoqPilot is twofold. Firstly, we want to allow users to seamlessly combine multiple Coq generation approaches and provide a zero-setup experience for our tool. Secondly, we want to deliver a platform for LLM-based experiments on Coq proof generation. We developed a benchmarking system for Coq generation methods, available in the plugin, and conducted an experiment using it, showcasing the framework's possibilities. Demo of CoqPilot is available at: https://youtu.be/oB1Lx-So9Lo. Code at: https://github.com/JetBrains-Research/coqpilot

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Proofster: Automated Formal Verification (ICSE ’23). IEEE Press, 26–30. https://doi.org/10.1109/ICSE-Companion58688.2023.00018
  2. Yves Bertot and Pierre Castéran. 2013. Interactive theorem proving and program development: Coq’Art: the calculus of inductive constructions. Springer Science & Business Media. https://doi.org/10.1007/978-3-662-07964-5
  3. The tactician: A seamless, interactive tactic learner and prover for coq. In International Conference on Intelligent Computer Mathematics. Springer, 271–277. https://doi.org/10.1007/978-3-030-53518-6_17
  4. Łukasz Czajka and Cezary Kaliszyk. 2018. Hammer for Coq: Automation for dependent type theory. Journal of automated reasoning 61 (2018), 423–453. https://doi.org/doi:10.1007/s10817-018-9458-4
  5. The Lean theorem prover (system description). In Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25. Springer, 378–388. https://doi.org/10.1007/978-3-319-21401-6_26
  6. Henrico Dolfing. 2019. The $440 Million Software Error at Knight Capital. Retrieved June 3, 2024 from https://www.henricodolfing.com/2019/06/project-failure-case-study-knight-capital.html
  7. Visual Studio Code Extension and Language Server Protocol for Coq. https://github.com/ejgallego/coq-lsp
  8. TacTok: Semantics-aware proof synthesis. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–31. https://doi.org/10.1145/3428299
  9. Baldur: Whole-proof generation and repair with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1229–1241. https://doi.org/10.1145/3611643.3616243
  10. A Survey on Large Language Models for Code Generation. arXiv preprint arXiv:2406.00515 (2024). https://doi.org/10.48550/arXiv.2406.00515
  11. CompCert-a formally verified optimizing compiler. In ERTS 2016: Embedded Real Time Software and Systems, 8th European Congress.
  12. Jessica MacNeil. 2019. Mariner 1 destroyed due to code error, July 22, 1962. Retrieved June 3, 2024 from https://www.edn.com/mariner-1-destroyed-due-to-code-error-july-22-1962/
  13. Isabelle/HOL: a proof assistant for higher-order logic. Springer. https://doi.org/10.1007/3-540-45949-9_5
  14. Logical Foundations. Software Foundations, Vol. 1. Electronic textbook.
  15. QED at large: A survey of engineering of formally verified software. Foundations and Trends® in Programming Languages 5, 2-3 (2019), 102–281. https://doi.org/10.1561/2500000045
  16. Graph2Tac: Learning Hierarchical Representations of Math Concepts in Theorem proving. arXiv preprint arXiv:2401.02949 (2024). https://doi.org/10.48550/arXiv.2401.02949
  17. Generating correctness proofs with neural networks. In Proceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 1–10. https://doi.org/10.1145/3394450.3397466
  18. A language-agent approach to formal theorem-proving. arXiv preprint arXiv:2310.04353 (2023). https://doi.org/10.48550/arXiv.2310.04353
  19. Kaiyu Yang and Jia Deng. 2019. Learning to prove theorems via interacting with proof assistants. In International Conference on Machine Learning. PMLR, 6984–6994. https://doi.org/10.48550/arXiv.1905.09381
  20. Finding and understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation. 283–294. https://doi.org/10.1145/1993498.1993532

Summary

  • The paper introduces CoqPilot, a plugin that integrates LLMs with traditional methods to automate and enhance Coq proof generation.
  • The methodology combines premise selection, multiple LLMs, and the Coq-LSP to achieve notable proof success, reaching up to 51% on a 300-theorem dataset.
  • The results demonstrate significant time savings in proof writing and pave the way for advanced, machine-assisted formal verification practices.

Evaluation of CoqPilot: A Plugin for LLM-based Proof Generation

The paper introduces CoqPilot, a VS Code plugin developed to aid in automating the writing of Coq proofs. Coq, being an interactive theorem prover, relies heavily on user interaction for constructing formal proofs through tactics. This paper details CoqPilot's design, focusing on enhancing proof generation by leveraging LLMs alongside other methodologies.

Overview and Methodology

CoqPilot is engineered to address the complexity and time demands of generating formal proofs in Coq. It acts as a versatile tool that combines LLMs with traditional methods such as Tactician and CoqHammer. The integration of LLMs like GPT-4o and GPT-3.5 is particularly significant; CoqPilot optimizes their performance through premise selection, providing context-theorems to the LLMs. This improves the LLMs' ability to generate proofs by structuring the contextual input and guiding their outputs.

The plugin focuses on two primary goals: firstly, to deliver a seamless setup for users by combining multiple proof generation approaches; secondly, to facilitate LLM-based experimentation in Coq proof generation. To this end, CoqPilot also implements automatic proof validation using the Coq Language Server Protocol (Coq-LSP), enabling the framework to identify and rectify failed attempts.

Empirical Results and Performance

The evaluation uses the IMM dataset, focusing on a subset of 300 theorems with varying proof lengths. CoqPilot's experimentation reveals distinct improvements in LLM-aided proof generation. For instance, GPT-4o, when optimized within CoqPilot, correctly proves 34% of the sample set. The collective effect of all models within CoqPilot reaches 39%, illustrating the benefits of using multiple methods. Notably, this is further enhanced when combined with traditional tools, achieving a 51% success rate.

These empirical results underscore CoqPilot's efficacy. The ability to improve proof generation success using a blend of techniques highlights the platform's potential to advance formal verification practices significantly.

Implications and Future Prospects

The implications of CoqPilot's development are multifaceted. Practically, it streamlines the proof-writing process, reducing the time and expertise required while maintaining the rigorous standards of formal verification. Theoretically, the integration of LLMs into theorem proving opens new avenues for machine-assisted logical reasoning.

Future research could explore further optimization of LLM integration and expand support to other interactive theorem provers. Broadening the dataset and refining the premise selection process could enhance the generalizability and robustness of CoqPilot.

Conclusion

CoqPilot represents a significant advancement in automating the Coq proof-writing process. By effectively merging LLMs with established methodologies, it provides a powerful tool that promises to expand the potential of formal software verification. The plugin's modularity and extensibility set a foundation for further innovation and integration of AI in formal logic environments.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 122 likes about this paper.