Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cobblestone: Iterative Automation for Formal Verification

Published 25 Oct 2024 in cs.LO, cs.AI, and cs.PL | (2410.19940v1)

Abstract: Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but it is expensive. Writing proofs manually requires both significant effort and expertise. Recent research has used machine learning to automatically synthesize proofs, reducing verification effort, but these tools are able to prove only a fraction of the desired software properties. We introduce Cobblestone, a new proof-synthesis approach that improves on the state of the art by taking advantage of partial progress in proof synthesis attempts. Unlike prior tools, Cobblestone can produce multiple unsuccessful proofs using a LLM, identify the working portions of those proofs, and combine them into a single, successful proof, taking advantage of internal partial progress. We evaluate Cobblestone on two benchmarks of open-source Coq projects, controlling for training data leakage in LLM datasets. Fully automatically, Cobblestone can prove 48% of the theorems, while Proverbot9001, the previous state-of-the-art, learning-based, proof-synthesis tool, can prove 17%. Cobblestone establishes a new state of the art for fully automated proof synthesis tools for Coq. We also evaluate Cobblestone in a setting where it is given external partial proof progress from oracles, serving as proxies for a human proof engineer or another tool. When the theorem is broken down into a set of subgoals and Cobblestone is given a set of relevant lemmas already proven in the project, it can prove up to 58% of the theorems. We qualitatively study the theorems Cobblestone is and is not able to prove to outline potential future research directions to further improve proof synthesis, including developing interactive, semi-automated tools. Our research shows that tools can make better use of partial progress made during proof synthesis to more effectively automate formal verification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. SOSRepair: Expressive semantic search for real-world program repair. IEEE Transactions on Software Engineering (TSE), 47(10):2162–2181, October 2021.
  2. FairSquare: Probabilistic verification for program fairness. In OOPSLA, 2017.
  3. Anonymous. Cobblestone replication package, 2024.
  4. Llemma: An Open Language Model For Mathematics, 2023.
  5. Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501, 2019.
  6. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  7. Graph2tac: Online representation learning of formal math concepts. In Forty-first International Conference on Machine Learning, 2024.
  8. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  9. Towards neural synthesis for SMT-assisted proof-oriented programming. CoRR, abs/2405.01787, 2024.
  10. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  11. Boosting of thoughts: Trial-and-error problem solving with large language models. In The Twelfth International Conference on Learning Representations, 2024.
  12. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  13. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402, 2023.
  14. Ł. Czajka and C. Kaliszyk. Hammer for Coq: Automation for dependent type theory. Journal of Automated Reasoning, 61(1-4):423–453, 2018.
  15. L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. 2008.
  16. L. de Moura and S. Ullrich. The Lean 4 theorem prover and programming language. In Automated Deduction — CADE 28, pages 625–635, 2021.
  17. Automated program repair, what is it good for? Not absolutely nothing! In International Conference on Software Engineering (ICSE), pages 1017–1029, April 2024.
  18. Can large language models transform natural language intent into formal method postconditions? Proceedings of the ACM Software Engineering (PACMSE), 1(FSE):84:1–84:24, July 2024.
  19. E. First and Y. Brun. Diversity-driven automated formal verification. In International Conference on Software Engineering (ICSE), pages 749–761, May 2022.
  20. TacTok: Semantics-aware proof synthesis. Proceedings of the ACM on Programming Languages (PACMPL) Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) issue, 4:231:1–231:31, Nov. 2020.
  21. Baldur: Whole-proof generation and repair with large language models. In ESEC/FSE, 2023.
  22. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720, 2022.
  23. Fairness testing: Testing software for discrimination. In ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 498–510, September 2017.
  24. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023.
  25. Fairness guarantees under demographic shift. In International Conference on Learning Representations (ICLR), April 2022.
  26. Automatic generation of oracles for exceptional behaviors. In International Symposium on Software Testing and Analysis (ISSTA), pages 213–224, July 2016.
  27. Seldonian toolkit: Building software with safe and fair machine learning. In International Conference on Software Engineering (ICSE) Demo track, pages 107–111, May 2023.
  28. GamePad: A learning environment for theorem proving. CoRR, 2018.
  29. MUSTARD: Mastering uniform synthesis of theorem and proof data. In The Twelfth International Conference on Learning Representations, 2024.
  30. Thor: Wielding hammers to integrate language models and automated theorem provers. In NeurIPS, 2022.
  31. LISA: Language models of ISAbelle proofs. In Conference on Artificial Intelligence and Theorem Proving (AITP), pages 17.1–17.3, September 2021.
  32. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In The Eleventh International Conference on Learning Representations, 2023.
  33. Shaping program repair space with existing patches and similar code. In ACM/SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), pages 298–309, July 2018.
  34. SeL4: Formal verification of an OS kernel. In ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP), pages 207–220, 2009.
  35. H. Krasner. The cost of poor software quality in the US: A 2022 report. https://www.it-cisq.org/wp-content/uploads/sites/6/2022/11/CPSQ-Report-Nov-22-2.pdf, 2022.
  36. CakeML: A verified implementation of ml. ACM SIGPLAN Notices, 49(1):179–191, 2014.
  37. C. Lattner and N. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization (CGO), pages 75–86, Mar. 2004.
  38. Automated program repair. Communications of the ACM, 62(12):56–65, Nov. 2019.
  39. X. Leroy. Formal certification of a compiler back-end or: Programming a compiler with a proof assistant. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 42–54, 2006.
  40. X. Leroy. Formal verification of a realistic compiler. Communications of the ACM (CACM), 52(7):107–115, 2009.
  41. Solving quantitative reasoning problems with language models. CoRR, abs/2206.14858, 2022.
  42. A survey on deep learning for theorem proving. arXiv preprint arXiv:2404.09939, 2024.
  43. TBar: Revisiting template-based automated program repair. In International Symposium on Software Testing and Analysis (ISSTA), pages 31–42, 2019.
  44. P. Masci and A. Dutle. Proof Mate: An Interactive Proof Helper for PVS (Tool Paper). In NASA Formal Methods Symposium, pages 809–815. Springer, 2022.
  45. J. Menn and A. Gregg. Crowdstrike blames global it outage on bug in system for checking updates. The Washington Post, July 2024.
  46. Offline contextual bandits with high probability fairness guarantees. In Annual Conference on Neural Information Processing Systems (NeurIPS), Advances in Neural Information Processing Systems 32, pages 14893–14904, December 2019.
  47. Magnushammer: A transformer-based approach to premise selection. In International Conference on Learning Representations (ICLR), 2024.
  48. M. Motwani and Y. Brun. Automatically generating precise oracles from structured natural language specifications. In Proceedings of the 41st International Conference on Software Engineering (ICSE), pages 188–199, May 2019.
  49. M. Motwani and Y. Brun. Better automatic program repair by using bug reports and tests together. In International Conference on Software Engineering (ICSE), pages 1229–1241, May 2023.
  50. Quality of automated program repair on real-world defects. IEEE Transactions on Software Engineering (TSE), 48(2):637–661, February 2022.
  51. seL4: From general purpose to a proof of information flow enforcement. In IEEE Symposium on Security and Privacy (S&P), pages 415–429, May 2013.
  52. Isabelle/HOL: A proof assistant for higher-order logic, volume 2283. Springer Science & Business Media, 2002.
  53. OpenAI. GPT-4 Technical Report, 2023.
  54. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  55. Graph representations for higher-order logic and theorem proving. In Conference on Artificial Intelligence (AAAI), pages 2967–2974. AAAI Press, 2020.
  56. L. Paulson and T. Nipkow. The Sledgehammer: Let automatic theorem provers write your Isabelle scripts! https://isabelle.in.tum.de/website-Isabelle2009-1/sledgehammer.html, 2023.
  57. S. Phipathananunth. Towards the formal verification of wigderson’s algorithm. SPLASH 2023, page 40–42. Association for Computing Machinery, 2023.
  58. T. Ringer. Proof Repair. PhD thesis, University of Washington, 2021.
  59. REPLica: REPL instrumentation for Coq analysis. In International Conference on Certified Programs and Proofs (CPP), pages 99–113, 2020.
  60. Code Llama: Open Foundation Models for Code, 2023.
  61. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm. Proceedings of the ACM on Software Engineering, 1(FSE):951–971, 2024.
  62. Generating correctness proofs with neural networks. In ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), pages 1–10, 2020.
  63. Passport: Improving Automated Formal Verification Using Identifiers. ACM Transactions on Programming Languages and Systems (TOPLAS), 45(2):12:1–12:30, June 2023.
  64. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
  65. S. Schulz. System description: E 1.8. In International Conference on Logic for Programming Artificial Intelligence and Reasoning, pages 735–743. Springer, 2013.
  66. Is the cure worse than the disease? Overfitting in automated program repair. In ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pages 532–543, September 2015.
  67. Towards large language models as copilots for theorem proving in lean. Corr, abs/2404.12534, 2024.
  68. R. M. Stallman. Using the GNU Compiler Collection. Free Software Foundation, 2012.
  69. A Language-Agent Approach to Formal Theorem-Proving, 2023.
  70. The Coq Development Team. Coq, v.8.7. https://coq.inria.fr, 2017.
  71. Preventing undesirable behavior of intelligent machines. Science, 366(6468):999–1004, 22 November 2019.
  72. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.
  73. A type system for privacy properties. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 409–423. Association for Computing Machinery, 2017.
  74. Lego-prover: Neural theorem proving with growing libraries. In The Twelfth International Conference on Learning Representations, 2024.
  75. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  76. Chain-of-thought prompting elicits reasoning in large language models. arXiv [cs.CL], Jan. 2022.
  77. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  78. Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024.
  79. K. Yang and J. Deng. Learning to prove theorems via interacting with proof assistants. In International Conference on Machine Learning (ICML), 2019.
  80. LeanDojo: Theorem proving with retrieval-augmented language models, 2023.
  81. Finding and understanding bugs in C compilers. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 283–294, 2011.
  82. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  83. H. Ye and M. Monperrus. Iter: Iterative neural repair for multi-location patches. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–13, 2024.
  84. C2S: Translating natural language comments to formal program specifications. In European Software Engineering Conference and ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 25–37, 2020.
  85. Getting More out of Large Language Models for Proofs. 8th Conference on Artificial Intelligence and Theorem Proving, Sept. 2023.
  86. Decomposing the enigma: Subgoal-based demonstration learning for formal theorem proving. arXiv preprint arXiv:2305.16366, 2023.
  87. Lyra: Orchestrating dual correction in automated theorem proving. arXiv preprint arXiv:2309.15806, 2023.
  88. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  89. A syntax-guided edit decoder for neural program repair. In European Software Engineering Conference and ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE), pages 341–353, 2021.

Summary

  • The paper introduces Cobblestone, which iteratively synthesizes verified proofs by combining partial successes from multiple LLM-generated attempts.
  • It achieves state-of-the-art results with a 48% success rate on CoqGym and 38% on coq-wigderson, outperforming tools like Proverbot9001 and CoqHammer.
  • The methodology incorporates external data, such as proven lemmas and subgoal decompositions, paving the way for interactive and efficient formal verification.

Overview of "Cobblestone: Iterative Automation for Formal Verification"

The paper "Cobblestone: Iterative Automation for Formal Verification" introduces Cobblestone, a novel approach for automating formal proof synthesis, specifically targeting the Coq proof assistant. In the field of formal verification, manually constructing proofs is a labor-intensive process that requires considerable expertise. Traditional approaches like using proof assistants have seen improvements with machine learning techniques, yet the automation of proof synthesis remains somewhat constrained in scope and effectiveness.

Key Contributions

  1. Partial Proof Synthesis with LLMs: Cobblestone leverages LLMs to iterate over potential proofs, highlighting its capacity to combine segments of partial proofs into a coherent and verified proof. Its distinctiveness lies in using partial successes from multiple unsuccessful proof attempts and synthesizing them into accurate proofs.
  2. Evaluation against Benchmarks: Cobblestone's performance was measured against two benchmarks: a subset from the CoqGym test set and the coq-wigderson project. Notably, CoqGym is widely used for evaluating proof synthesis tools but may include potential data leakage due to its presence in LLMs' pre-training datasets. Conversely, coq-wigderson was specifically chosen to mitigate such risks due to its post-LLM pre-training creation.
  3. State-of-the-Art Results: The evaluation results indicate that Cobblestone outperforms existing state-of-the-art proof synthesis tools and baselines, including Proverbot9001 and CoqHammer. On CoqGym100, it achieved a 48% success rate, significantly exceeding previous methods' performance. On Wigderson100, Cobblestone also showed superior results with a 38% success rate.
  4. Incorporation of External Information: Cobblestone can harness external data, such as proven lemmas or subgoal decompositions from human engineers or other tools. This approach showed substantial enhancement in its performance, marking a promising direction for development in interactive or semi-automated proof synthesis systems.

Implications and Future Directions

The implications of this research are substantial for both practical and theoretical aspects of AI:

  • Practical Implications: The progress of tools like Cobblestone heralds potential reductions in the time and expertise needed to verify software systems formally. Improved automation may encourage broader adoption of formal methods in industry, contributing to higher software reliability and reduced costs due to software faults.
  • Theoretical Implications: The use of LLMs in Cobblestone extends beyond traditional applications by employing natural language processing models effectively in formal verification settings. This interdisciplinary approach highlights opportunities for LLMs to solve complex structured problems in software engineering domains.
  • Speculative Future Developments: As Cobblestone demonstrates the synergy between LLMs and formal verification processes, future AI developments might focus on more refined integration of diverse data sources and training paradigms to improve the understanding and synthesis of proofs. Such advancements could lead to interactive verification environments that accentuate collaboration between software engineers and AI tools.

Conclusion

In conclusion, Cobblestone presents a robust and innovative step forward in the automation of formal verification processes. Its methodology not only surpasses existing automated proof synthesis tools in effectiveness but also paves the way for more interactive and user-friendly verification systems. By leveraging state-of-the-art machine learning frameworks, Cobblestone can potentially redefine the landscape of formal software verification, enabling greater accessibility and usability in critical software system developments.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 41 likes about this paper.