ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation (2307.00588v1)
Abstract: Recent advancements in LLMs have demonstrated exceptional success in a wide range of general domain tasks, such as question answering and following instructions. Moreover, LLMs have shown potential in various software engineering applications. In this study, we present a systematic comparison of test suites generated by the ChatGPT LLM and the state-of-the-art SBST tool EvoSuite. Our comparison is based on several critical factors, including correctness, readability, code coverage, and bug detection capability. By highlighting the strengths and weaknesses of LLMs (specifically ChatGPT) in generating unit test cases compared to EvoSuite, this work provides valuable insights into the performance of LLMs in solving software engineering problems. Overall, our findings underscore the potential of LLMs in software engineering and pave the way for further research in this area.
- H. Zhu, P. A. V. Hall, and J. H. R. May, “Software unit test coverage and adequacy,” ACM Comput. Surv., vol. 29, no. 4, p. 366–427, 1997.
- M. Harman, S. A. Mansouri, and Y. Zhang, “Search-based software engineering: Trends, techniques and applications,” ACM Computing Surveys (CSUR), vol. 45, no. 1, pp. 1–61, 2012.
- G. Fraser and A. Arcuri, “Whole test suite generation,” IEEE Transactions on Software Engineering, vol. 39, no. 2, pp. 276–291, 2013.
- Z. Zhou, Y. Zhou, C. Fang, Z. Chen, and Y. Tang, “Selectively combining multiple coverage goals in search-based unit test generation,” in 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–12.
- N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. B. Brown, D. Song, U. Erlingsson et al., “Extracting training data from large language models.” in USENIX Security Symposium, vol. 6, 2021.
- T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” 2007.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, 2022.
- A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, “Intellicode compose: Code generation using transformer,” in Proc. of ESEC/FSE, 2020, p. 1433–1443.
- U. Alon, R. Sadaka, O. Levy, and E. Yahav, “Structural language models for any-code generation,” 2019.
- G. Poesia, A. Polozov, V. Le, A. Tiwari, G. Soares, C. Meek, and S. Gulwani, “Synchromesh: Reliable code generation from pre-trained language models,” in Proc. of ICLR, 2022.
- P. W. McBurney and C. McMillan, “Automatic source code summarization of context for java methods,” IEEE Transactions on Software Engineering, vol. 42, no. 2, pp. 103–119, 2016.
- S. Haiduc, J. Aponte, and A. Marcus, “Supporting program comprehension with source code summarization,” in Proc. of ICSE, 2010, p. 223–226.
- J. Zhang, X. Wang, H. Zhang, H. Sun, and X. Liu, “Retrieval-based neural source code summarization,” in Proc. of ICSE, 2020, p. 1385–1397.
- P. W. McBurney and C. McMillan, “Automatic documentation generation via source code summarization of method context,” in Proc. of ICPC, 2014, p. 279–290.
- X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, “Deep code comment generation,” in Proc. of ICPC, 2018, p. 200–210.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901.
- OpenAI, “Chatgpt: Optimizing language models for dialogue,” 2023, https://openai.com/blog/chatgpt/.
- P. Tonella, “Evolutionary testing of classes,” in Proc. of ISSTA, 2004, p. 119–128.
- A. Panichella, F. M. Kifetew, and P. Tonella, “Reformulating branch coverage as a many-objective optimization problem,” in Proc. of ICST, 2015, pp. 1–10.
- ——, “Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets,” IEEE Transactions on Software Engineering, vol. 44, no. 2, pp. 122–158, 2018.
- G. Fraser and A. Arcuri, “Evosuite: Automatic test suite generation for object-oriented software,” in Proc. of ESEC/FSE, 2011, p. 416–419.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” arXiv preprint arXiv:2203.02155, 2022.
- M. Artetxe, J. Du, N. Goyal, L. Zettlemoyer, and V. Stoyanov, “On the role of bidirectionality in language model pre-training,” arXiv preprint arXiv:2205.11726, 2022.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
- Y. Lin, Y. S. Ong, J. Sun, G. Fraser, and J. S. Dong, “Graph-based seed object synthesis for search-based unit testing,” in Proc. of ESEC/FSE, 2021, p. 1068–1080.
- Defects4J, “Defects4j: A database of real faults and an experimental infrastructure to enable controlled experiments in software engineering research,” 2023, https://github.com/rjust/defects4j.
- Evosuite, “Evosuite: Automatic test suite generation for java,” 2023, https://www.evosuite.org/.
- SpotBugs, “Spotbugs,” 2023, https://spotbugs.github.io/index.html.
- B. Pugh and D. Hovemeyer, “Findbugs,” 2023, https://findbugs.sourceforge.net/.
- N. Ayewah, W. Pugh, D. Hovemeyer, J. D. Morgenthaler, and J. Penix, “Using static analysis to find bugs,” IEEE Software, vol. 25, no. 5, pp. 22–29, 2008.
- JetBrain, “Intellij idea – the leading java and kotlin ide,” 2023, https://www.jetbrains.com/idea/.
- Spotbugs, “Spotbug bug descriptions,” 2023, https://spotbugs.readthedocs.io/en/stable/bugDescriptions.html.
- CheckStyle, “Checkstyle,” 2023, https://checkstyle.sourceforge.io/.
- Oracle, “Code conventions for the java programming language,” 1999, https://www.oracle.com/java/technologies/javase/codeconventions-contents.html.
- Google, “Google java style guide,” 2023, https://google.github.io/styleguide/javaguide.html.
- C. E. C. Dantas and M. A. Maia, “Readability and understandability scores for snippet assessment: an exploratory study,” arXiv preprint arXiv:2108.09181, 2021.
- P. S. C. Analyzer, “Pmd,” 2023, https://pmd.github.io/.
- sonarsource, “Cognitive computing: A new way of measuring understandability,” 2021, https://www.sonarsource.com/docs/CognitiveComplexity.pdf.
- M. G. . C. KG, “Jacoco java code coverage library,” 2023, https://www.jacoco.org/jacoco/.
- A. Vargha and H. D. Delaney, “A critique and improvement of the cl common language effect size statistics of mcgraw and wong,” Journal of Educational and Behavioral Statistics, vol. 25, no. 2, pp. 101–132, 2000.
- J. M. Rojas, J. Campos, M. Vivanti, G. Fraser, and A. Arcuri, “Combining multiple coverage criteria in search-based unit test generation,” in Search-Based Software Engineering: 7th International Symposium, 2015, pp. 93–108.
- K. Shrestha and M. J. Rutherford, “An empirical evaluation of assertions as oracles,” in 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation, 2011, pp. 110–119.
- G. Jahangirova, D. Clark, M. Harman, and P. Tonella, “An empirical validation of oracle improvement,” IEEE Transactions on Software Engineering, vol. 47, no. 8, pp. 1708–1728, 2021.
- V. Terragni, G. Jahangirova, P. Tonella, and M. Pezzè, “Gassert: A fully automated tool to improve assertion oracles,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), 2021, pp. 85–88.
- Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
- Y. Zhang, S. Sun, M. Galley, Y.-C. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan, “DIALOGPT : Large-scale generative pre-training for conversational response generation,” in Proc. of ACL, 2020.
- J. Pilault, R. Li, S. Subramanian, and C. Pal, “On extractive and abstractive neural document summarization with transformer language models,” in Proc. of EMNLP, 2020, pp. 9308–9319.
- X. Cai, S. Liu, J. Han, L. Yang, Z. Liu, and T. Liu, “Chestxraybert: A pretrained language model for chest radiology report summarization,” IEEE Transactions on Multimedia, pp. 845 – 855, 2021.
- D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi, “Unifiedqa: Crossing format boundaries with a single qa system,” arXiv preprint arXiv:2005.00700, 2020.
- K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- N. D. Bui, Y. Yu, and L. Jiang, “Infercode: Self-supervised learning of code representations by predicting subtrees,” in Proc. of ICSE. IEEE, 2021, pp. 1186–1197.
- M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations. arxiv 2018,” arXiv preprint arXiv:1802.05365, vol. 12, 2018.
- G. Copilot, “Your ai pair programmer,” 2023, https://github.com/features/copilot/.
- W. Miller and D. L. Spooner, “Automatic generation of floating-point test data,” IEEE Transactions on Software Engineering, no. 3, pp. 223–226, 1976.
- Z. Li, M. Harman, and R. M. Hierons, “Search algorithms for regression test case prioritization,” IEEE Transactions on Software Engineering, vol. 33, no. 4, pp. 225–237, 2007.
- R. A. Silva, S. d. R. S. de Souza, and P. S. L. de Souza, “A systematic review on search based mutation testing,” Information and Software Technology, vol. 81, pp. 19–35, 2017.
- K. R. Walcott, M. L. Soffa, G. M. Kapfhammer, and R. S. Roos, “Time-aware test suite prioritization,” in Proc. of ISSTA, 2006, pp. 1–12.
- G. Grano, C. Laaber, A. Panichella, and S. Panichella, “Testing with fewer resources: An adaptive approach to performance-aware test case generation,” IEEE Transactions on Software Engineering, vol. 47, no. 11, pp. 2332–2347, 2019.
- A. Arcuri and J. P. Galeotti, “Enhancing search-based testing with testability transformations for existing apis,” ACM Transactions on Software Engineering and Methodology, vol. 31, no. 1, pp. 1–34, 2021.
- Y. Lin, J. Sun, G. Fraser, Z. Xiu, T. Liu, and J. S. Dong, “Recovering fitness gradients for interprocedural boolean flags in search-based testing,” in Proc. of ISSTA, 2020, pp. 440–451.
- P. Braione, G. Denaro, A. Mattavelli, and M. Pezzè, “Combining symbolic execution and search-based testing for programs with complex heap inputs,” in Proc. of ISSTA, 2017, pp. 90–101.
- X. Xu, Z. Zhu, and L. Jiao, “An adaptive fitness function based on branch hardness for search based testing,” in Proc. of GECCO, 2017, pp. 1335–1342.
- J. M. Rojas, J. Campos, M. Vivanti, G. Fraser, and A. Arcuri, “Combining multiple coverage criteria in search-based unit test generation,” in Search-Based Software Engineering, M. Barros and Y. Labiche, Eds., 2015, pp. 93–108.
- G. Gay, “Generating effective test suites by combining coverage criteria,” in Search Based Software Engineering, 2017, pp. 65–82.
- E. Daka, J. M. Rojas, and G. Fraser, “Generating unit tests with descriptive names or: Would you name your children thing1 and thing2?” in Proc. of ISSTA, 2017, pp. 57–67.
- D. Roy, Z. Zhang, M. Ma, V. Arnaoudova, A. Panichella, S. Panichella, D. Gonzalez, and M. Mirakhorli, “Deeptc-enhancer: Improving the readability of automatically generated tests,” in Proc. of ASE, 2020, pp. 287–298.
- S. Wang, N. Shrestha, A. K. Subburaman, J. Wang, M. Wei, and N. Nagappan, “Automatic unit test generation for machine learning libraries: How far are we?” in Proc. of ICSE, 2021, pp. 1548–1560.
- Z. Dong, M. Böhme, L. Cojocaru, and A. Roychoudhury, “Time-travel testing of android apps,” in Proc. of ICSE, 2020, pp. 481–492.
- A. Martin-Lopez, S. Segura, and A. Ruiz-Cortés, “Restest: automated black-box testing of restful web apis,” in Proc. of ISSTA, 2021, pp. 682–685.
- F. U. Haq, D. Shin, L. C. Briand, T. Stifter, and J. Wang, “Automatic test suite generation for key-points detection dnns using many-objective search (experience paper),” in Proc. of ISSTA, 2021, pp. 91–102.
- Yutian Tang (17 papers)
- Zhijie Liu (16 papers)
- Zhichao Zhou (24 papers)
- Xiapu Luo (106 papers)