Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-Powered Test Case Generation for Detecting Tricky Bugs (2404.10304v1)

Published 16 Apr 2024 in cs.SE and cs.LG

Abstract: Conventional automated test generation tools struggle to generate test oracles and tricky bug-revealing test inputs. LLMs can be prompted to produce test inputs and oracles for a program directly, but the precision of the tests can be very low for complex scenarios (only 6.3% based on our experiments). To fill this gap, this paper proposes AID, which combines LLMs with differential testing to generate fault-revealing test inputs and oracles targeting plausibly correct programs (i.e., programs that have passed all the existing tests). In particular, AID selects test inputs that yield diverse outputs on a set of program variants generated by LLMs, then constructs the test oracle based on the outputs. We evaluate AID on two large-scale datasets with tricky bugs: TrickyBugs and EvalPlus, and compare it with three state-of-the-art baselines. The evaluation results show that the recall, precision, and F1 score of AID outperform the state-of-the-art by up to 1.80x, 2.65x, and 1.66x, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. [n. d.]. EvalPlus Pre-Generated LLM Code Samples. https://github.com/evalplus/evalplus/releases/tag/v0.1.0
  2. [n. d.]. TrickyBugs. https://github.com/RinCloud/TrickyBugs
  3. An orchestrated survey of methodologies for automated software test case generation. Journal of systems and software 86, 8 (2013), 1978–2001.
  4. The oracle problem in software testing: A survey. IEEE transactions on software engineering 41, 5 (2014), 507–525.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  6. Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering. 2130–2141.
  7. Jon Edvardsson. 1999. A survey on automatic test data generation. In Proceedings of the 2nd Conference on Computer Science and Engineering. 21–28.
  8. Robert B Evans and Alberto Savoia. 2007. Differential testing: a new approach to change detection. In The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers. 549–552.
  9. Large Language Models for Software Engineering: Survey and Open Problems. arXiv:2310.03533 [cs.SE]
  10. Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419.
  11. Pal: Program-aided language models. In International Conference on Machine Learning. PMLR, 10764–10799.
  12. Automatic generation of oracles for exceptional behaviors. In Proceedings of the 25th international symposium on software testing and analysis. 213–224.
  13. Investigating and Detecting Silent Bugs in PyTorch Programs. ([n. d.]).
  14. An empirical study on fine-tuning large language models of code for automated program repair. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1162–1174.
  15. Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, 14–26.
  16. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=1qvx610Cu7
  17. TrickyBugs: A Dataset of Corner-case Bugs in Plausible Programs. In Proceedings of the 21st International Conference on Mining Software Repositories (MSR 2024). https://doi.org/10.1145/3643991.3644870
  18. Who Judges the Judge: An Empirical Study on Online Judge Tests. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 334–346. https://doi.org/10.1145/3597926.3598060
  19. Towards More Realistic Evaluation for Neural Test Oracle Generation. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 589–600. https://doi.org/10.1145/3597926.3598080
  20. Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172.
  21. Phil McMinn. 2004. Search-based software test data generation: a survey. Software testing, Verification and reliability 14, 2 (2004), 105–156.
  22. Phil McMinn. 2011. Search-based software testing: Past, present and future. In 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops. IEEE, 153–163.
  23. What do we know about defect detection methods?[software testing]. IEEE software 23, 3 (2006), 82–90.
  24. Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM. arXiv preprint arXiv:2402.00097 (2024).
  25. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering 50, 1 (2024), 85–105. https://doi.org/10.1109/TSE.2023.3334955
  26. Silent bugs in deep learning frameworks: An empirical study of Keras and TensorFlow. Empirical Software Engineering 29, 1 (2024), 10.
  27. @ tcomment: Testing javadoc comments to detect comment-code inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. IEEE, 260–269.
  28. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).
  29. Generating accurate assert statements for unit test cases using pretrained transformers. In Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test. 54–64.
  30. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498 (2022).
  31. Software testing with large language model: Survey, landscape, and vision. IEEE Transactions on Software Engineering (2024).
  32. On learning meaningful assert statements for unit test cases. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1398–1409.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
  34. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494.
  35. ChatUniTest: a ChatGPT-based automated unit test generation tool. arXiv preprint arXiv:2305.04764 (2023).
  36. Automated conformance testing for javascript engines via deep compiler fuzzing. In Proceedings of the 42nd ACM SIGPLAN international conference on programming language design and implementation. 435–450.
  37. arXiv:2305.04207 [cs.SE]
  38. Michal Zalewski. 2015. American Fuzzy Lop (AFL). lcamtuf.coredump.cx/afl/
  39. C2S: translating natural language comments to formal program specifications. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 25–37.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Kaibo Liu (17 papers)
  2. Yiyang Liu (12 papers)
  3. Zhenpeng Chen (39 papers)
  4. Jie M. Zhang (39 papers)
  5. Yudong Han (8 papers)
  6. Yun Ma (38 papers)
  7. Ge Li (213 papers)
  8. Gang Huang (86 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com