Impact of Large Language Models on Generating Software Specifications (2306.03324v2)
Abstract: Software specifications are essential for ensuring the reliability of software systems. Existing specification extraction approaches, however, suffer from limited generalizability and require manual efforts. The recent emergence of LLMs, which have been successfully applied to numerous software engineering tasks, offers a promising avenue for automating this process. In this paper, we conduct the first empirical study to evaluate the capabilities of LLMs for generating software specifications from software comments or documentation. We evaluate LLMs' performance with Few Shot Learning (FSL), enabling LLMs to generalize from a small number of examples, as well as different prompt construction strategies, and compare the performance of LLMs with traditional approaches. Additionally, we conduct a comparative diagnosis of the failure cases from both LLMs and traditional methods, identifying their unique strengths and weaknesses. Lastly, we conduct extensive experiments on 15 state of the art LLMs, evaluating their performance and cost effectiveness for generating software specifications. Our results show that with FSL, LLMs outperform traditional methods (by 5.6%), and more sophisticated prompt construction strategies can further enlarge this performance gap (up to 5.1 to 10.0%). Yet, LLMs suffer from their unique challenges, such as ineffective prompts and the lack of domain knowledge, which together account for 53 to 60% of LLM unique failures. The strong performance of open source models (e.g., StarCoder) makes closed source models (e.g., GPT 3 Davinci) less desirable due to size and cost. Our study offers valuable insights for future research to improve specification generation.
- A. Blasi, A. Goffi, K. Kuznetsov, A. Gorla, M. D. Ernst, M. Pezzè, and S. D. Castellanos, “Translating code comments to procedure specifications,” Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2018.
- A. Goffi, A. Gorla, M. D. Ernst, and M. Pezzè, “Automatic generation of oracles for exceptional behaviors,” in Proceedings of the 25th international symposium on software testing and analysis, 2016, pp. 213–224.
- D. Xie, Y. Li, M. Kim, H. V. Pham, L. Tan, X. Zhang, and M. W. Godfrey, “Docter: documentation-guided fuzzing for testing deep learning api functions,” Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2021.
- L. Tan, D. Yuan, G. Krishna, and Y. Zhou, “/*icomment: bugs or bad comments?*/,” in Symposium on Operating Systems Principles, 2007.
- E. Wong, L. Zhang, S. Wang, T. Liu, and L. Tan, “Dase: Document-assisted symbolic execution for improving automated software testing,” 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 1, pp. 620–631, 2015.
- C. Boyapati, S. Khurshid, and D. Marinov, “Korat: Automated testing based on java predicates,” SIGSOFT Software Engineering Notes, vol. 27, no. 4, pp. 123–133, 2002.
- C. Cadar, D. Dunbar, and D. Engler, “KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs,” in Proceedings of the 8th USENIX conference on Operating systems design and implementation, 2008, pp. 209–224.
- S. H. Tan, D. Marinov, L. Tan, and G. T. Leavens, “@tcomment: Testing javadoc comments to detect comment-code inconsistencies,” 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, pp. 260–269, 2012.
- R. Pandita, X. Xiao, H. Zhong, T. Xie, S. Oney, and A. Paradkar, “Inferring method specifications from natural language api descriptions,” in 2012 34th international conference on software engineering (ICSE). IEEE, 2012, pp. 815–825.
- J. Zhai, Y. Shi, M. Pan, G. Zhou, Y. Liu, C. Fang, S. Ma, L. Tan, and X. Zhang, “C2s: translating natural language comments to formal program specifications,” Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020.
- T. Lv, R. Li, Y. Yang, K. Chen, X. Liao, X. Wang, P. Hu, and L. Xing, “Rtfm! automatic assumption discovery and verification derivation from library document for api misuse detection,” Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020.
- Y. Zhou, C. Wang, X. Yan, T. Chen, S. Panichella, and H. C. Gall, “Automatic detection and repair recommendation of directive defects in java api documentation,” IEEE Transactions on Software Engineering, vol. 46, pp. 1004–1023, 2020.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” CoRR, vol. abs/2005.14165, 2020. [Online]. Available: https://arxiv.org/abs/2005.14165
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374
- E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “A conversational paradigm for program synthesis,” arXiv preprint, 2022.
- D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model for code infilling and synthesis,” 2022. [Online]. Available: https://arxiv.org/abs/2204.05999
- W. Yue, W. Weishi, J. Shafiq, and C. H. Steven, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, 2021.
- J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” arXiv preprint arXiv:2305.01210, 2023.
- Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng, “No more manual tests? evaluating and improving chatgpt for unit test generation,” arXiv preprint arXiv:2305.04207, 2023.
- B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen, “Codet: Code generation with generated tests,” arXiv preprint arXiv:2207.10397, 2022.
- M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “Adaptive test generation using a large language model,” arXiv preprint arXiv:2302.06527, 2023.
- N. Jiang, T. Lutellier, and L. Tan, “Impact of code language models on automated program repair,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023.
- K. Pei, D. Bieber, K. Shi, C. Sutton, and P. Yin, “Can large language models reason about program invariants?” 2023.
- W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, Jun. 2021, pp. 2655–2668. [Online]. Available: https://aclanthology.org/2021.naacl-main.211
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. J. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” ArXiv, vol. abs/2005.14165, 2020.
- M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Annual Meeting of the Association for Computational Linguistics, 2019.
- A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.
- B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model,” https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- S. Black, G. Leo, P. Wang, C. Leahy, and S. Biderman, “GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow,” Mar. 2021. [Online]. Available: https://doi.org/10.5281/zenodo.5297715
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” CoRR, vol. abs/1910.10683, 2019. [Online]. Available: http://arxiv.org/abs/1910.10683
- P. Riley, N. Constant, M. Guo, G. Kumar, D. Uthus, and Z. Parekh, “Textsettr: Few-shot text style extraction and tunable targeted restyling,” 01 2021, pp. 3786–3800.
- O. Ram, Y. Kirstain, J. Berant, A. Globerson, and O. Levy, “Few-shot question answering by pretraining span selection,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 3066–3079. [Online]. Available: https://aclanthology.org/2021.acl-long.239
- R. Wang, T. Yu, H. Zhao, S. Kim, S. Mitra, R. Zhang, and R. Henao, “Few-shot class-incremental learning for named entity recognition,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 571–582. [Online]. Available: https://aclanthology.org/2022.acl-long.43
- S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An explanation of in-context learning as implicit bayesian inference,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=RdJVFCHjUMI
- L. Breiman and P. Spector, “Submodel selection and evaluation in regression. the x-random case,” International statistical review/revue internationale de Statistique, pp. 291–319, 1992.
- M. Magnusson, M. Andersen, J. Jonasson, and A. Vehtari, “Bayesian leave-one-out cross-validation for large data,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 4244–4253. [Online]. Available: https://proceedings.mlr.press/v97/magnusson19a.html
- O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context learning,” ArXiv, vol. abs/2112.08633, 2021.
- J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen, “What makes good in-context examples for gpt-3333?” arXiv preprint arXiv:2101.06804, 2021.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, 2019. [Online]. Available: http://arxiv.org/abs/1907.11692
- “all-roberta-large-v1,” https://huggingface.co/sentence-transformers/all-roberta-large-v1, Accessed: 2023.
- N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. [Online]. Available: http://arxiv.org/abs/1908.10084
- R. Shin, C. H. Lin, S. Thomson, C. C. Chen, S. Roy, E. A. Platanios, A. Pauls, D. Klein, J. Eisner, and B. V. Durme, “Constrained language models yield few-shot semantic parsers,” ArXiv, vol. abs/2104.08768, 2021.
- R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
- OpenAI, “Gpt-3.5,” 2022. [Online]. Available: https://platform.openai.com/docs/models/gpt-3-5
- T. L. Scao et al., “Bloom: A 176b-parameter open-access multilingual language model,” ArXiv, vol. abs/2211.05100, 2022.
- E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou, “Codegen2: Lessons for training llms on programming and natural languages,” arXiv preprint arXiv:2305.02309, 2023.
- OpenAI, “Chatgpt,” 2022, accessed: 2023-03-10. [Online]. Available: https://openai.com/chat-gpt/
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017. [Online]. Available: http://arxiv.org/abs/1706.03762
- L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis, N. Muennighoff, M. Mishra, A. Gu, M. Dey et al., “Santacoder: don’t reach for the stars!” arXiv preprint arXiv:2301.03988, 2023.
- D. Kocetkov, R. Li, L. Ben Allal, J. Li, C. Mou, C. Muñoz Ferrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf, D. Bahdanau, L. von Werra, and H. de Vries, “The stack: 3 tb of permissively licensed source code,” Preprint, 2022.
- D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. F. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” CoRR, vol. abs/1909.08593, 2019. [Online]. Available: http://arxiv.org/abs/1909.08593
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” 2022. [Online]. Available: https://arxiv.org/abs/2203.02155
- R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman, “Webgpt: Browser-assisted question-answering with human feedback,” CoRR, vol. abs/2112.09332, 2021. [Online]. Available: https://arxiv.org/abs/2112.09332
- N. Lambert, L. Castricato, L. von Werra, and A. Havrilla, “Illustrating reinforcement learning from human feedback (rlhf),” Hugging Face Blog, 2022, https://huggingface.co/blog/rlhf.
- “Hugging face api for bloom,” https://huggingface.co/bigscience/bloom, Accessed: 2023.
- L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy, “The pile: An 800gb dataset of diverse text for language modeling,” CoRR, vol. abs/2101.00027, 2021. [Online]. Available: https://arxiv.org/abs/2101.00027
- “Bigquery dataset,” https://console.cloud.google.com/marketplace/details/github/github-repos?pli=1, Accessed: 2023.
- X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J. Du, R. Pasunuru, S. Shleifer, P. S. Koura, V. Chaudhary, B. O’Horo, J. Wang, L. Zettlemoyer, Z. Kozareva, M. T. Diab, V. Stoyanov, and X. Li, “Few-shot learning with multilingual language models,” CoRR, vol. abs/2112.10668, 2021. [Online]. Available: https://arxiv.org/abs/2112.10668
- “Preparing your dataset,” https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset, Accessed: 2023.
- C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in International conference on software engineering (ICSE), 2023.
- S. Abukhalaf, M. Hamdaqa, and F. Khomh, “On codex prompt engineering for ocl generation: An empirical study,” arXiv preprint arXiv:2303.16244, 2023.
- L. Beurer-Kellner, M. Fischer, and M. Vechev, “Prompting is programming: A query language for large language models,” Proceedings of the ACM on Programming Languages, vol. 7, no. PLDI, pp. 1946–1969, 2023.
- C. Wang, Y. Yang, C. Gao, Y. Peng, H. Zhang, and M. R. Lyu, “No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 382–394.
- X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022.
- C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,” arXiv preprint arXiv:2304.00385, 2023.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- S. Black, S. R. Biderman, E. Hallahan, Q. G. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, M. M. Pieler, U. S. Prashanth, S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gpt-neox-20b: An open-source autoregressive language model,” ArXiv, vol. abs/2204.06745, 2022.
- OpenAI, “Gpt-4,” 2022. [Online]. Available: https://platform.openai.com/docs/models/gpt-4
- ——, “Pricing,” 2023. [Online]. Available: https://openai.com/pricing
- J. Chen, Q. Liu, H. Lin, X. Han, and L. Sun, “Few-shot named entity recognition with self-describing networks,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 5711–5722. [Online]. Available: https://aclanthology.org/2022.acl-long.392
- K. Krishna, D. Nathani, X. Garcia, B. Samanta, and P. Talukdar, “Few-shot controllable style transfer for low-resource multilingual settings,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 7439–7468. [Online]. Available: https://aclanthology.org/2022.acl-long.514
- Y. Deng, C. S. Xia, H. Peng, C. Yang, and L. Zhang, “Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, 2023, pp. 423–435.
- S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Exploring llm-based general bug reproduction,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023.
- W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li, G. Deng, S. Huang, Y. Chen, Q. Zhang et al., “Automatic code summarization via chatgpt: How far are we?” arXiv preprint arXiv:2305.12865, 2023.
- A. M. Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, and Z. M. J. Jiang, “Github copilot ai pair programmer: Asset or liability?” Journal of Systems and Software, vol. 203, p. 111734, 2023.
- B. Xu, T.-D. Nguyen, T. Le-Cong, T. Hoang, J. Liu, K. Kim, C. Gong, C. Niu, C. Wang, B. Le et al., “Are we ready to embrace generative ai for software q&a?” arXiv preprint arXiv:2307.09765, 2023.
- Danning Xie (6 papers)
- Byungwoo Yoo (2 papers)
- Nan Jiang (210 papers)
- Mijung Kim (8 papers)
- Lin Tan (25 papers)
- Xiangyu Zhang (328 papers)
- Judy S. Lee (1 paper)