Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models (2310.06320v1)
Abstract: Software testing is a core discipline in software engineering where a large array of research results has been produced, notably in the area of automatic test generation. Because existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications, most testing procedures still rely on test cases written by humans to form test suites. Such test suites, however, are incomplete: they only cover parts of the project or they are produced after the bug is fixed. Yet, several research challenges, such as automatic program repair, and practitioner processes, build on the assumption that available test suites are sufficient. There is thus a need to break existing barriers in automatic test case generation. While prior work largely focused on random unit testing inputs, we propose to consider generating test cases that realistically represent complex user execution scenarios, which reveal buggy behaviour. Such scenarios are informally described in bug reports, which should therefore be considered as natural inputs for specifying bug-triggering test cases. In this work, we investigate the feasibility of performing this generation by leveraging LLMs and using bug reports as inputs. Our experiments include the use of ChatGPT, as an online service, as well as CodeGPT, a code-related pre-trained LLM that was fine-tuned for our task. Overall, we experimentally show that bug reports associated to up to 50% of Defects4J bugs can prompt ChatGPT to generate an executable test case. We show that even new bug reports can indeed be used as input for generating executable test cases. Finally, we report experimental results which confirm that LLM-generated test cases are immediately useful in software engineering tasks such as fault localization as well as patch validation in automated program repair.
- A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–37.
- An orchestrated survey of methodologies for automated software test case generation. Journal of Systems and Software 86, 8 (2013), 1978–2001.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Sidong Feng and Chunyang Chen. 2023. Prompting Is All Your Need: Automated Android Bug Replay with Large Language Models. arXiv preprint arXiv:2306.01987 (2023).
- Automatic creation of acceptance tests by extracting conditionals from requirements: Nlp approach and case study. Journal of Systems and Software 197 (2023), 111549.
- Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419.
- Gordon Fraser and Andrea Arcuri. 2013. Evosuite: On the challenges of test case generation in the real world. In 2013 IEEE sixth international conference on software testing, verification and validation. IEEE, 362–369.
- Geoffrey K Gill and Chris F Kemerer. 1991. Cyclomatic complexity density and software maintenance productivity. IEEE transactions on software engineering 17, 12 (1991), 1284–1288.
- Automated program repair. Commun. ACM 62, 12 (2019), 56–65.
- Program synthesis. Foundations and Trends® in Programming Languages 4, 1-2 (2017), 1–119.
- Deep code comment generation. In Proceedings of the 26th conference on program comprehension. 200–210.
- iFixR: Bug report driven program repair. In Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 314–325.
- On reliability of patch correctness assessment. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 524–535.
- Overfitting in semantics-based automated program repair. In Proceedings of the 40th International Conference on Software Engineering. 163–163.
- You cannot fix what you cannot find! an investigation of fault localization bias in benchmarking automated program repair systems. In 2019 12th IEEE conference on software testing, validation and verification (ICST). IEEE, 102–113.
- TBar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 31–42.
- On the efficiency of test suite based program repair: A systematic assessment of 16 automated repair systems for java programs. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 615–627.
- Adam Lopez. 2008. Statistical machine translation. ACM Computing Surveys (CSUR) 40, 3 (2008), 1–49.
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. CoRR abs/2102.04664 (2021).
- Patrick E McKnight and Julius Najab. 2010. Mann-Whitney U Test. The Corsini encyclopedia of psychology (2010), 1–1.
- Martin Monperrus. 2018. Automatic software repair: A bibliography. ACM Computing Surveys (CSUR) 51, 1 (2018), 1–24.
- Automatic test generation: A use case driven approach. IEEE Transactions on Software Engineering 32, 3 (2006), 140–155.
- Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816.
- Adaptive Test Generation Using a Large Language Model. arXiv preprint arXiv:2302.06527 (2023).
- Is the cure worse than the disease? overfitting in automated program repair. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 532–543.
- An analysis of the automatic bug fixing performance of chatgpt. arXiv preprint arXiv:2301.08653 (2023).
- Dissection of a Bug Dataset: Anatomy of 395 Patches from Defects4J. In Proceedings of SANER.
- Felix Stahlberg. 2020. Neural machine translation: A review. Journal of Artificial Intelligence Research 69 (2020), 343–418.
- Kunal Taneja and Tao Xie. 2008. DiffGen: Automated regression unit-test generation. In 2008 23rd IEEE/ACM International Conference on Automated Software Engineering. IEEE, 407–410.
- MSeqGen: Object-oriented unit-test generation via mining source code. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering. 193–202.
- Predicting Patch Correctness Based on the Similarity of Failing Test Cases. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 4 (2022), 1–30.
- Is ChatGPT the Ultimate Programming Assistant–How far is it? arXiv preprint arXiv:2304.11938 (2023).
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Context-aware patch generation for better automated program repair. In Proceedings of the 40th international conference on software engineering. 1–11.
- Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
- Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational automated program repair. arXiv preprint arXiv:2301.13246 (2023).
- Identifying patch correctness in test-based program repair. In Proceedings of the 40th international conference on software engineering. 789–799.
- Better test cases for better automated program repair. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 831–841.
- Alleviating patch overfitting with automatic test generation: a study of feasibility and effectiveness for the Nopol repair system. Empirical Software Engineering 24 (2019), 33–67.
- No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. arXiv preprint arXiv:2305.04207 (2023).
- Laura Plein (5 papers)
- Wendkûuni C. Ouédraogo (6 papers)
- Jacques Klein (89 papers)
- Tegawendé F. Bissyandé (82 papers)