Do LLMs generate test oracles that capture the actual or the expected program behaviour? (2410.21136v1)
Abstract: Software testing is an essential part of the software development cycle to improve the code quality. Typically, a unit test consists of a test prefix and a test oracle which captures the developer's intended behaviour. A known limitation of traditional test generation techniques (e.g. Randoop and Evosuite) is that they produce test oracles that capture the actual program behaviour rather than the expected one. Recent approaches leverage LLMs, trained on an enormous amount of data, to generate developer-like code and test cases. We investigate whether the LLM-generated test oracles capture the actual or expected software behaviour. We thus, conduct a controlled experiment to answer this question, by studying LLMs performance on two tasks, namely, test oracle classification and generation. The study includes developer-written and automatically generated test cases and oracles for 24 open-source Java repositories, and different well tested prompts. Our findings show that LLM-based test generation approaches are also prone on generating oracles that capture the actual program behaviour rather than the expected one. Moreover, LLMs are better at generating test oracles rather than classifying the correct ones, and can generate better test oracles when the code contains meaningful test or variable names. Finally, LLM-generated test oracles have higher fault detection potential than the Evosuite ones.
- E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE Trans. Software Eng., vol. 41, no. 5, pp. 507–525, 2015. [Online]. Available: https://doi.org/10.1109/TSE.2014.2372785
- G. Fraser and A. Arcuri, “Evolutionary generation of whole test suites,” in 2011 11th International Conference on Quality Software, 2011, pp. 31–40.
- N. Tillmann and J. de Halleux, “Pex–white box test generation for .net,” in Tests and Proofs, B. Beckert and R. Hähnle, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 134–153.
- C. Pacheco and M. D. Ernst, “Randoop: feedback-directed random testing for java,” in Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, 2007, pp. 815–816.
- V. Dallmeier, N. Knopp, C. Mallon, S. Hack, and A. Zeller, “Generating test cases for specification mining,” in Proceedings of the 19th International Symposium on Software Testing and Analysis, ser. ISSTA ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 85–96. [Online]. Available: https://doi.org/10.1145/1831708.1831719
- J. M. Rojas, G. Fraser, and A. Arcuri, “Automated unit test generation during software development: a controlled experiment and think-aloud observations,” in Proceedings of the 2015 International Symposium on Software Testing and Analysis, ISSTA 2015, Baltimore, MD, USA, July 12-17, 2015, M. Young and T. Xie, Eds. ACM, 2015, pp. 338–349. [Online]. Available: https://doi.org/10.1145/2771783.2771801
- Z. Yuan, M. Liu, S. Ding, K. Wang, Y. Chen, X. Peng, and Y. Lou, “Evaluating and improving chatgpt for unit test generation,” Proc. ACM Softw. Eng., vol. 1, no. FSE, pp. 1703–1726, 2024. [Online]. Available: https://doi.org/10.1145/3660783
- G. Fraser and A. Arcuri, “Evosuite: automatic test suite generation for object-oriented software,” in SIGSOFT/FSE’11 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-19) and ESEC’11: 13th European Software Engineering Conference (ESEC-13), Szeged, Hungary, September 5-9, 2011, T. Gyimóthy and A. Zeller, Eds. ACM, 2011, pp. 416–419. [Online]. Available: https://doi.org/10.1145/2025113.2025179
- C. Cadar, D. Dunbar, and D. R. Engler, “KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs,” in 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2008, December 8-10, 2008, San Diego, California, USA, Proceedings, R. Draves and R. van Renesse, Eds. USENIX Association, 2008, pp. 209–224. [Online]. Available: http://www.usenix.org/events/osdi08/tech/full\_papers/cadar/cadar.pdf
- Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng, “No more manual tests? evaluating and improving chatgpt for unit test generation,” arXiv preprint arXiv:2305.04207, 2023.
- M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Transactions on Software Engineering, 2023.
- J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,” IEEE Trans. Softw. Eng., vol. 50, no. 4, p. 911–936, Feb. 2024. [Online]. Available: https://doi.org/10.1109/TSE.2024.3368208
- M. Allamanis, E. T. Barr, P. T. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,” ACM Comput. Surv., vol. 51, no. 4, pp. 81:1–81:37, 2018. [Online]. Available: https://doi.org/10.1145/3212695
- E. Dinella, G. Ryan, T. Mytkowicz, and S. K. Lahiri, “Toga: a neural method for test oracle generation,” in Proceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 2130–2141. [Online]. Available: https://doi.org/10.1145/3510003.3510141
- S. B. Hossain and M. Dwyer, “Togll: Correct and strong test oracle generation with llms,” 2024. [Online]. Available: https://arxiv.org/abs/2405.03786
- C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshyvanyk, “On learning meaningful assert statements for unit test cases,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 1398–1409. [Online]. Available: https://doi.org/10.1145/3377811.3380429
- R. White and J. Krinke, “Reassert: Deep learning for assert generation,” 2020. [Online]. Available: https://arxiv.org/abs/2011.09784
- M. Tufano, D. Drain, A. Svyatkovskiy, and N. Sundaresan, “Generating accurate assert statements for unit test cases using pretrained transformers,” in Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test, ser. AST ’22. ACM, May 2022. [Online]. Available: http://dx.doi.org/10.1145/3524481.3527220
- A. Mastropaolo, S. Scalabrino, N. Cooper, D. N. Palacio, D. Poshyvanyk, R. Oliveto, and G. Bavota, “Studying the usage of text-to-text transfer transformer to support code-related tasks,” 2021. [Online]. Available: https://arxiv.org/abs/2102.02017
- Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model for programming and natural languages,” 2020. [Online]. Available: https://arxiv.org/abs/2002.08155
- S. B. Hossain, A. Filieri, M. B. Dwyer, S. Elbaum, and W. Visser, “Neural-based test oracle generation: A large-scale evaluation and lessons learned,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 120–132. [Online]. Available: https://doi.org/10.1145/3611643.3616265
- Z. Liu, K. Liu, X. Xia, and X. Yang, “Towards more realistic evaluation for neural test oracle generation,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 589–600. [Online]. Available: https://doi.org/10.1145/3597926.3598080
- M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. L. Traon, and M. Harman, “Chapter six - mutation testing advances: An analysis and survey,” Advances in Computers, vol. 112, pp. 275–378, 2019. [Online]. Available: https://doi.org/10.1016/bs.adcom.2018.03.015
- R. Degiovanni and M. Papadakis, “μ𝜇\muitalic_μbert: Mutation testing using pre-trained language models,” in 2022 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2022, pp. 160–169.
- H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “PIT: a practical mutation testing tool for java (demo),” in Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, Saarbrücken, Germany, July 18-20, 2016, A. Zeller and A. Roychoudhury, Eds. ACM, 2016, pp. 449–452. [Online]. Available: https://doi.org/10.1145/2931037.2948707
- Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model for programming and natural languages,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020, ser. Findings of ACL, T. Cohn, Y. He, and Y. Liu, Eds., vol. EMNLP 2020. Association for Computational Linguistics, 2020, pp. 1536–1547. [Online]. Available: https://doi.org/10.18653/v1/2020.findings-emnlp.139
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” 2020. [Online]. Available: https://arxiv.org/abs/2005.14165
- T. T. Chekam, M. Papadakis, Y. L. Traon, and M. Harman, “An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption,” in Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, S. Uchitel, A. Orso, and M. P. Robillard, Eds. IEEE / ACM, 2017, pp. 597–608. [Online]. Available: https://doi.org/10.1109/ICSE.2017.61
- L. Yang, C. Yang, S. Gao, W. Wang, B. Wang, Q. Zhu, X. Chu, J. Zhou, G. Liang, Q. Wang et al., “An empirical study of unit test generation with large language models,” arXiv preprint arXiv:2406.18181, 2024.
- W. C. Ouédraogo, K. Kaboré, H. Tian, Y. Song, A. Koyuncu, J. Klein, D. Lo, and T. F. Bissyandé, “Large-scale, independent and comprehensive study of the power of llms for test case generation,” arXiv preprint arXiv:2407.00225, 2024.
- J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: the threats of using llms in software engineering,” in Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, ser. ICSE-NIER’24. New York, NY, USA: Association for Computing Machinery, 2024, p. 102–106. [Online]. Available: https://doi.org/10.1145/3639476.3639764
- Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation via chatgpt,” arXiv preprint arXiv:2304.07590, 2023.
- A. Silva, N. Saavedra, and M. Monperrus, “Gitbug-java: A reproducible benchmark of recent java bugs,” in Proceedings of the 21st International Conference on Mining Software Repositories.
- M. L. Siddiq, J. Santos, R. H. Tanvir, N. Ulfat, F. Rifat, and V. C. Lopes, “Exploring the effectiveness of large language models in generating unit tests,” arXiv preprint arXiv:2305.00418, 2023.
- W. C. Ouédraogo, K. Kaboré, H. Tian, Y. Song, A. Koyuncu, J. Klein, D. Lo, and T. F. Bissyandé, “Large-scale, independent and comprehensive study of the power of llms for test case generation,” 2024. [Online]. Available: https://arxiv.org/abs/2407.00225
- Y. Tang, Z. Liu, Z. Zhou, and X. Luo, “Chatgpt vs sbst: A comparative assessment of unit test suite generation,” IEEE Transactions on Software Engineering, vol. 50, no. 6, pp. 1340–1359, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.