Evaluating the Ability of Large Language Models to Generate Verifiable Specifications in VeriFast (2411.02318v3)
Abstract: Static verification is a powerful method for enhancing software quality, but it demands significant human labor and resources. This is particularly true of static verifiers that reason about heap manipulating programs using an ownership logic. LLMs have shown promise in a number of software engineering activities, including code generation, test generation, proof generation for theorem provers, and specification generation for static verifiers. However, prior work has not explored how well LLMs can perform specification generation for specifications based in an ownership logic, such as separation logic. To address this gap, this paper explores OpenAI's GPT-4o model's effectiveness in generating specifications on C programs that are verifiable with VeriFast, a separation logic based static verifier. Our experiment employs three different types of user inputs as well as basic and Chain-of-Thought (CoT) prompting to assess GPT's capabilities. Our results indicate that the specifications generated by GPT-4o preserve functional behavior, but struggle to be verifiable. When the specifications are verifiable they contain redundancies. Future directions are discussed to improve the performance.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397 (2022).
- Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435.
- Baldur: Whole-proof generation and repair with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1229–1241.
- Gillian, part i: a multi-language platform for symbolic execution. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. 927–942.
- Beyond Code Generation: Assessing Code LLM Maturity with Postconditions. arXiv preprint arXiv:2407.14118 (2024).
- Charles Antony Richard Hoare. 1969. An axiomatic basis for computer programming. Commun. ACM 12, 10 (1969), 576–580.
- VeriFast: A powerful, sound, predictable, fast verifier for C and Java. In NASA formal methods symposium. Springer, 41–55.
- LISA: Language models of ISAbelle proofs. In 6th Conference on Artificial Intelligence and Theorem Proving. 378–392.
- Finding inductive loop invariants using large language models. arXiv preprint arXiv:2311.07948 (2023).
- Verus: Verifying rust programs using linear ghost types. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 286–315.
- K Rustan M Leino. 2010. Dafny: An automatic program verifier for functional correctness. In International conference on logic for programming artificial intelligence and reasoning. Springer, 348–370.
- Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931.
- SpecGen: Automated Generation of Formal Program Specifications via Large Language Models. arXiv preprint arXiv:2401.08807 (2024).
- Towards ai-assisted synthesis of verified dafny methods. Proceedings of the ACM on Software Engineering 1, FSE (2024), 812–835.
- Laurel: Generating Dafny Assertions Using Large Language Models. arXiv:2405.16792Â [cs.LO] https://arxiv.org/abs/2405.16792
- Viper: A verification infrastructure for permission-based reasoning. In Verification, Model Checking, and Abstract Interpretation: 17th International Conference, VMCAI 2016, St. Petersburg, FL, USA, January 17-19, 2016. Proceedings 17. Springer, 41–62.
- CAT-LM training language models on aligned code and tests. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 409–420.
- John C Reynolds. 2002. Separation logic: A logic for shared mutable data structures. In Proceedings 17th Annual IEEE Symposium on Logic in Computer Science. IEEE, 55–74.
- A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927 (2024).
- Automatic generation of programming exercises and code explanations using large language models. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 1. 27–43.
- An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering (2023).
- Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering (2024).
- Sean Welleck and Rahul Saha. 2023. LLMSTEP: LLM proofstep suggestions in Lean. arXiv preprint arXiv:2310.18457 (2023).
- Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
- Leandojo: Theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems 36 (2024).
- Human-level few-shot concept induction through minimax entropy learning. Science Advances 10, 16 (2024), eadg2488.
- Lyra: Orchestrating dual correction in automated theorem proving. arXiv preprint arXiv:2309.15806 (2023).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.