Cobblestone: Iterative Automation for Formal Verification
Abstract: Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but it is expensive. Writing proofs manually requires both significant effort and expertise. Recent research has used machine learning to automatically synthesize proofs, reducing verification effort, but these tools are able to prove only a fraction of the desired software properties. We introduce Cobblestone, a new proof-synthesis approach that improves on the state of the art by taking advantage of partial progress in proof synthesis attempts. Unlike prior tools, Cobblestone can produce multiple unsuccessful proofs using a LLM, identify the working portions of those proofs, and combine them into a single, successful proof, taking advantage of internal partial progress. We evaluate Cobblestone on two benchmarks of open-source Coq projects, controlling for training data leakage in LLM datasets. Fully automatically, Cobblestone can prove 48% of the theorems, while Proverbot9001, the previous state-of-the-art, learning-based, proof-synthesis tool, can prove 17%. Cobblestone establishes a new state of the art for fully automated proof synthesis tools for Coq. We also evaluate Cobblestone in a setting where it is given external partial proof progress from oracles, serving as proxies for a human proof engineer or another tool. When the theorem is broken down into a set of subgoals and Cobblestone is given a set of relevant lemmas already proven in the project, it can prove up to 58% of the theorems. We qualitatively study the theorems Cobblestone is and is not able to prove to outline potential future research directions to further improve proof synthesis, including developing interactive, semi-automated tools. Our research shows that tools can make better use of partial progress made during proof synthesis to more effectively automate formal verification.
- SOSRepair: Expressive semantic search for real-world program repair. IEEE Transactions on Software Engineering (TSE), 47(10):2162–2181, October 2021.
- FairSquare: Probabilistic verification for program fairness. In OOPSLA, 2017.
- Anonymous. Cobblestone replication package, 2024.
- Llemma: An Open Language Model For Mathematics, 2023.
- Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501, 2019.
- Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Graph2tac: Online representation learning of formal math concepts. In Forty-first International Conference on Machine Learning, 2024.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
- Towards neural synthesis for SMT-assisted proof-oriented programming. CoRR, abs/2405.01787, 2024.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Boosting of thoughts: Trial-and-error problem solving with large language models. In The Twelfth International Conference on Learning Representations, 2024.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402, 2023.
- Ł. Czajka and C. Kaliszyk. Hammer for Coq: Automation for dependent type theory. Journal of Automated Reasoning, 61(1-4):423–453, 2018.
- L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. 2008.
- L. de Moura and S. Ullrich. The Lean 4 theorem prover and programming language. In Automated Deduction — CADE 28, pages 625–635, 2021.
- Automated program repair, what is it good for? Not absolutely nothing! In International Conference on Software Engineering (ICSE), pages 1017–1029, April 2024.
- Can large language models transform natural language intent into formal method postconditions? Proceedings of the ACM Software Engineering (PACMSE), 1(FSE):84:1–84:24, July 2024.
- E. First and Y. Brun. Diversity-driven automated formal verification. In International Conference on Software Engineering (ICSE), pages 749–761, May 2022.
- TacTok: Semantics-aware proof synthesis. Proceedings of the ACM on Programming Languages (PACMPL) Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) issue, 4:231:1–231:31, Nov. 2020.
- Baldur: Whole-proof generation and repair with large language models. In ESEC/FSE, 2023.
- Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720, 2022.
- Fairness testing: Testing software for discrimination. In ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 498–510, September 2017.
- Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023.
- Fairness guarantees under demographic shift. In International Conference on Learning Representations (ICLR), April 2022.
- Automatic generation of oracles for exceptional behaviors. In International Symposium on Software Testing and Analysis (ISSTA), pages 213–224, July 2016.
- Seldonian toolkit: Building software with safe and fair machine learning. In International Conference on Software Engineering (ICSE) Demo track, pages 107–111, May 2023.
- GamePad: A learning environment for theorem proving. CoRR, 2018.
- MUSTARD: Mastering uniform synthesis of theorem and proof data. In The Twelfth International Conference on Learning Representations, 2024.
- Thor: Wielding hammers to integrate language models and automated theorem provers. In NeurIPS, 2022.
- LISA: Language models of ISAbelle proofs. In Conference on Artificial Intelligence and Theorem Proving (AITP), pages 17.1–17.3, September 2021.
- Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In The Eleventh International Conference on Learning Representations, 2023.
- Shaping program repair space with existing patches and similar code. In ACM/SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), pages 298–309, July 2018.
- SeL4: Formal verification of an OS kernel. In ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP), pages 207–220, 2009.
- H. Krasner. The cost of poor software quality in the US: A 2022 report. https://www.it-cisq.org/wp-content/uploads/sites/6/2022/11/CPSQ-Report-Nov-22-2.pdf, 2022.
- CakeML: A verified implementation of ml. ACM SIGPLAN Notices, 49(1):179–191, 2014.
- C. Lattner and N. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization (CGO), pages 75–86, Mar. 2004.
- Automated program repair. Communications of the ACM, 62(12):56–65, Nov. 2019.
- X. Leroy. Formal certification of a compiler back-end or: Programming a compiler with a proof assistant. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 42–54, 2006.
- X. Leroy. Formal verification of a realistic compiler. Communications of the ACM (CACM), 52(7):107–115, 2009.
- Solving quantitative reasoning problems with language models. CoRR, abs/2206.14858, 2022.
- A survey on deep learning for theorem proving. arXiv preprint arXiv:2404.09939, 2024.
- TBar: Revisiting template-based automated program repair. In International Symposium on Software Testing and Analysis (ISSTA), pages 31–42, 2019.
- P. Masci and A. Dutle. Proof Mate: An Interactive Proof Helper for PVS (Tool Paper). In NASA Formal Methods Symposium, pages 809–815. Springer, 2022.
- J. Menn and A. Gregg. Crowdstrike blames global it outage on bug in system for checking updates. The Washington Post, July 2024.
- Offline contextual bandits with high probability fairness guarantees. In Annual Conference on Neural Information Processing Systems (NeurIPS), Advances in Neural Information Processing Systems 32, pages 14893–14904, December 2019.
- Magnushammer: A transformer-based approach to premise selection. In International Conference on Learning Representations (ICLR), 2024.
- M. Motwani and Y. Brun. Automatically generating precise oracles from structured natural language specifications. In Proceedings of the 41st International Conference on Software Engineering (ICSE), pages 188–199, May 2019.
- M. Motwani and Y. Brun. Better automatic program repair by using bug reports and tests together. In International Conference on Software Engineering (ICSE), pages 1229–1241, May 2023.
- Quality of automated program repair on real-world defects. IEEE Transactions on Software Engineering (TSE), 48(2):637–661, February 2022.
- seL4: From general purpose to a proof of information flow enforcement. In IEEE Symposium on Security and Privacy (S&P), pages 415–429, May 2013.
- Isabelle/HOL: A proof assistant for higher-order logic, volume 2283. Springer Science & Business Media, 2002.
- OpenAI. GPT-4 Technical Report, 2023.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Graph representations for higher-order logic and theorem proving. In Conference on Artificial Intelligence (AAAI), pages 2967–2974. AAAI Press, 2020.
- L. Paulson and T. Nipkow. The Sledgehammer: Let automatic theorem provers write your Isabelle scripts! https://isabelle.in.tum.de/website-Isabelle2009-1/sledgehammer.html, 2023.
- S. Phipathananunth. Towards the formal verification of wigderson’s algorithm. SPLASH 2023, page 40–42. Association for Computing Machinery, 2023.
- T. Ringer. Proof Repair. PhD thesis, University of Washington, 2021.
- REPLica: REPL instrumentation for Coq analysis. In International Conference on Certified Programs and Proofs (CPP), pages 99–113, 2020.
- Code Llama: Open Foundation Models for Code, 2023.
- Code-aware prompting: A study of coverage-guided test generation in regression setting using llm. Proceedings of the ACM on Software Engineering, 1(FSE):951–971, 2024.
- Generating correctness proofs with neural networks. In ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), pages 1–10, 2020.
- Passport: Improving Automated Formal Verification Using Identifiers. ACM Transactions on Programming Languages and Systems (TOPLAS), 45(2):12:1–12:30, June 2023.
- Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
- S. Schulz. System description: E 1.8. In International Conference on Logic for Programming Artificial Intelligence and Reasoning, pages 735–743. Springer, 2013.
- Is the cure worse than the disease? Overfitting in automated program repair. In ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 532–543, September 2015.
- Towards large language models as copilots for theorem proving in lean. Corr, abs/2404.12534, 2024.
- R. M. Stallman. Using the GNU Compiler Collection. Free Software Foundation, 2012.
- A Language-Agent Approach to Formal Theorem-Proving, 2023.
- The Coq Development Team. Coq, v.8.7. https://coq.inria.fr, 2017.
- Preventing undesirable behavior of intelligent machines. Science, 366(6468):999–1004, 22 November 2019.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.
- A type system for privacy properties. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 409–423. Association for Computing Machinery, 2017.
- Lego-prover: Neural theorem proving with growing libraries. In The Twelfth International Conference on Learning Representations, 2024.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. arXiv [cs.CL], Jan. 2022.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024.
- K. Yang and J. Deng. Learning to prove theorems via interacting with proof assistants. In International Conference on Machine Learning (ICML), 2019.
- LeanDojo: Theorem proving with retrieval-augmented language models, 2023.
- Finding and understanding bugs in C compilers. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 283–294, 2011.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- H. Ye and M. Monperrus. Iter: Iterative neural repair for multi-location patches. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–13, 2024.
- C2S: Translating natural language comments to formal program specifications. In European Software Engineering Conference and ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 25–37, 2020.
- Getting More out of Large Language Models for Proofs. 8th Conference on Artificial Intelligence and Theorem Proving, Sept. 2023.
- Decomposing the enigma: Subgoal-based demonstration learning for formal theorem proving. arXiv preprint arXiv:2305.16366, 2023.
- Lyra: Orchestrating dual correction in automated theorem proving. arXiv preprint arXiv:2309.15806, 2023.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
- A syntax-guided edit decoder for neural program repair. In European Software Engineering Conference and ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 341–353, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.