Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions? (2310.01831v2)
Abstract: Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The emergent abilities of LLMs have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe nl2postcond, the problem of leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different nl2postcond approaches, using the correctness and discriminative power of generated postconditions. We then use qualitative and quantitative methods to assess the quality of nl2postcond postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that nl2postcond via LLMs has the potential to be helpful in practice; nl2postcond generated postconditions were able to catch 64 real-world historical bugs from Defects4J.
- Andrea Arcuri. 2008. On the automation of fixing software bugs. In Companion of the 30th international conference on Software engineering. 1003–1006.
- Amazon AWS. 2023. Amazon CodeWhisperer. Accessed September 27, 2023. https://aws.amazon.com/codewhisperer/.
- Translating code comments to procedure specifications. In Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis. 242–253.
- Formal specifications in software maintenance: From code to Z++ and back again. Information and Software Technology 35, 11-12 (1993), 679–690.
- Beyond assertions: Advanced specification and verification with JML and ESC/Java2. In Formal Methods for Components and Objects: 4th International Symposium, FMCO 2005, Amsterdam, The Netherlands, November 1-4, 2005, Revised Lectures 4. Springer, 342–363.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- An abstract interpretation framework for refactoring with application to extract methods with contracts. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications. 213–232.
- Edsger W Dijkstra and Carel S Scholten. 1990. The strongest postcondition. Predicate Calculus and Program Semantics (1990), 209–215.
- Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering. 2130–2141.
- Dynamically discovering likely program invariants to support program evolution. In Proceedings of the 21st international conference on Software engineering. 213–224.
- Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions. arXiv preprint arXiv:2304.03816 (2023).
- Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419.
- Learning invariants using decision trees and implication counterexamples. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 499–512. https://doi.org/10.1145/2837614.2837664
- GitHub. 2023. GitHub Copilot. Accessed September 27, 2023. https://github.com/features/copilot/.
- Automatic generation of oracles for exceptional behaviors. In Proceedings of the 25th international symposium on software testing and analysis. 213–224.
- Hao He. 2019. Understanding source code comments at large-scale. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1217–1219.
- Henry Hsu and Peter A Lachenbruch. 2014. Paired t test. Wiley StatsRef: statistics reference online (2014).
- Daniel Jackson. 1992. Aspect, a formal specification language for detecting bugs. Ph. D. Dissertation. Citeseer.
- Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678.
- Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
- I speak, you verify: Toward trustworthy neural program synthesis. arXiv preprint arXiv:2210.00848 (2022).
- The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533 (2022).
- Interactive code generation via test-driven user-intent formalization. arXiv preprint arXiv:2208.05950 (2022).
- Can language models learn from explanations in context? arXiv:2204.02329 [cs.CL]
- K Rustan M Leino. 2010. Dafny: An automatic program verifier for functional correctness. In Logic for Programming, Artificial Intelligence, and Reasoning: 16th International Conference, LPAR-16, Dakar, Senegal, April 25–May 1, 2010, Revised Selected Papers 16. Springer, 348–370.
- Contract driven development= test driven development-writing test cases. In Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. 425–434.
- CODAMOSA: Escaping coverage plateaus in test generation with pre-trained large language models. In International conference on software engineering (ICSE).
- StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
- Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. 37th cconference on Neural Information processing Systems (NeurIPS), 2023 (2023). https://arxiv.org/abs/2305.01210
- Using transfer learning for code-related tasks. IEEE Transactions on Software Engineering 49, 4 (2022), 1580–1598.
- The Spec# programming system: An overview. In Construction and Analysis of Safe, Secure, and Interoperable Smart devices (CASSIS)” volume 3362 of Lecture Notes in Computer Science.
- EvoSpex: An evolutionary algorithm for learning postconditions. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1223–1235.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
- Demystifying GPT Self-Repair for Code Generation. arXiv preprint arXiv:2306.09896 (2023).
- Agile specification-driven development. In International Conference on Extreme Programming and Agile Processes in Software Engineering. Springer, 104–112.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- Inferring method specifications from natural language API descriptions. In 2012 34th international conference on software engineering (ICSE). IEEE, 815–825.
- Can Large Language Models Reason about Program Invariants? (2023).
- Rolf-Helge Pfeiffer. 2020. What constitutes software? An empirical, descriptive study of artifacts. In Proceedings of the 17th International Conference on Mining Software Repositories. 481–491.
- Formal specification-driven development. In Proceedings of the International Conference on Software Engineering Research and Practice (SERP). The Steering Committee of The World Congress in Computer Science, Computer …, 1.
- CLN2INV: Learning Loop Invariants with Continuous Logic Networks. In International Conference on Learning Representations.
- Rahul Sharma and Alex Aiken. 2016. From invariant checking to invariant inference using randomized search. Formal Methods in System Design 48 (2016), 235–256.
- Static specification mining using automata-based abstractions. In Proceedings of the 2007 International Symposium on Software Testing and Analysis. 174–184.
- Secure Distributed Programming with Value-Dependent Types. In Proceedings of the 16th ACM SIGPLAN International Conference on Functional Programming (Tokyo, Japan) (ICFP ’11). Association for Computing Machinery, New York, NY, USA, 266–278. https://doi.org/10.1145/2034773.2034811
- Tabnine. 2023. Tabnine Code Completion. Accessed September 27, 2023. https://www.tabnine.com/.
- /* iComment: Bugs or bad comments?*. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles. 145–158.
- aComment: mining annotations from comments and code to detect interrupt related concurrency bugs. In Proceedings of the 33rd international conference on software engineering. 11–20.
- @ tcomment: Testing javadoc comments to detect comment-code inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. IEEE, 260–269.
- Unit Test Case Generation with Transformers and Focal Context. https://doi.org/10.48550/ARXIV.2009.05617
- Unit Test Case Generation with Transformers and Focal Context. arXiv:2009.05617 [cs.SE]
- Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers. In IEEE/ACM International Conference on Automation of Software Test, AST@ICSE 2022, Pittsburgh, PA, USA, May 21-22, 2022. ACM/IEEE, 54–64. https://doi.org/10.1145/3524481.3527220
- Can Large Language Models Write Good Property-Based Tests? arXiv preprint arXiv:2307.04346 (2023).
- Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022).
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
- Learning nonlinear loop invariants with gated continuous logic networks. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. 106–120.
- Inferring resource specifications from natural language API documentation. In 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE, 307–318.
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. arXiv:2205.10625 [cs.AI]
- Analyzing APIs documentation and code to detect directive defects. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 27–37.