Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM (2402.00097v2)

Published 31 Jan 2024 in cs.SE and cs.LG

Abstract: Testing plays a pivotal role in ensuring software quality, yet conventional Search Based Software Testing (SBST) methods often struggle with complex software units, achieving suboptimal test coverage. Recent works using LLMs for test generation have focused on improving generation quality through optimizing the test generation context and correcting errors in model outputs, but use fixed prompting strategies that prompt the model to generate tests without additional guidance. As a result LLM-generated testsuites still suffer from low coverage. In this paper, we present SymPrompt, a code-aware prompting strategy for LLMs in test generation. SymPrompt's approach is based on recent work that demonstrates LLMs can solve more complex logical problems when prompted to reason about the problem in a multi-step fashion. We apply this methodology to test generation by deconstructing the testsuite generation process into a multi-stage sequence, each of which is driven by a specific prompt aligned with the execution paths of the method under test, and exposing relevant type and dependency focal context to the model. Our approach enables pretrained LLMs to generate more complete test cases without any additional training. We implement SymPrompt using the TreeSitter parsing framework and evaluate on a benchmark challenging methods from open source Python projects. SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2. Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. [n. d.]. Am I in the stack? https://huggingface.co/spaces/bigcode/in-the-stack.
  2. A Transformer-based Approach for Source Code Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4998–5007. https://doi.org/10.18653/v1/2020.acl-main.449
  3. A3Test: Assertion-Augmented Automated Test Case Generation. arXiv preprint arXiv:2302.10352 (2023).
  4. Suggesting accurate method and class names. In Proceedings of the 2015 10th joint meeting on foundations of software engineering. 38–49.
  5. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021).
  6. A survey of symbolic execution techniques. ACM Computing Surveys (CSUR) 51, 3 (2018), 1–39.
  7. Code generation tools (almost) for free? a study of few-shot, pre-trained language models on code. arXiv preprint arXiv:2206.01335 (2022).
  8. Korat: Automated testing based on Java predicates. ACM SIGSOFT Software Engineering Notes 27, 4 (2002), 123–133.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  10. Ermira Daka and Gordon Fraser. 2014. A survey on unit testing practices and problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering. IEEE, 201–211.
  11. Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing. arXiv preprint arXiv:2308.16557 (2023).
  12. Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering. 2130–2141.
  13. Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419.
  14. Gordon Fraser and Andrea Arcuri. 2013. Evosuite: On the challenges of test case generation in the real world. In 2013 IEEE sixth international conference on software testing, verification and validation. IEEE, 362–369.
  15. Gordon Fraser and Andrea Arcuri. 2014. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM) 24, 2 (2014), 1–42.
  16. Gordon Fraser and Andrea Arcuri. 2015. 1600 faults in 100 projects: automatically finding faults while achieving high coverage with evosuite. Empirical software engineering 20 (2015), 611–639.
  17. Patrice Godefroid. 2007. Compositional dynamic test generation. In Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 47–54.
  18. DART: Directed automated random testing. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. 213–223.
  19. Automated Test Case Generation Using Code Models and Domain Adaptation. arXiv preprint arXiv:2308.08033 (2023).
  20. James C King. 1976. Symbolic execution and program testing. Commun. ACM 19, 7 (1976), 385–394.
  21. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533 (2022).
  22. CODAMOSA: Escaping coverage plateaus in test generation with pre-trained large language models. In International conference on software engineering (ICSE).
  23. Finding Failure-Inducing Test Cases with ChatGPT. arXiv preprint arXiv:2304.11686 (2023).
  24. Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172.
  25. Automated unit test generation for python. In Search-Based Software Engineering: 12th International Symposium, SSBSE 2020, Bari, Italy, October 7–8, 2020, Proceedings 12. Springer, 9–24.
  26. Thomas J. McCabe. 1976. A Complexity Measure. IEEE Transactions on Software Engineering SE-2 (1976), 308–320. https://api.semanticscholar.org/CorpusID:9116234
  27. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309 (2023).
  28. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  29. Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816.
  30. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 75–84.
  31. Strategic Planning. 2002. The economic impacts of inadequate infrastructure for software testing. National Institute of Standards and Technology 1 (2002).
  32. Adaptive test generation using a large language model. arXiv preprint arXiv:2302.06527 (2023).
  33. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering (2023).
  34. CUTE: A concolic unit testing engine for C. ACM SIGSOFT Software Engineering Notes 30, 5 (2005), 263–272.
  35. Nikolai Tillmann and Jonathan De Halleux. 2008. Pex–white box test generation for. net. In International conference on tests and proofs. Springer, 134–153.
  36. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).
  37. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
  38. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  39. BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 1556–1560. https://doi.org/10.1145/3368089.3417943
  40. ChatUniTest: a ChatGPT-based automated unit test generation tool. arXiv preprint arXiv:2305.04764 (2023).
  41. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023).
  42. Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey. Software testing, verification and reliability 22, 2 (2012), 67–120.
  43. No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. arXiv preprint arXiv:2305.04207 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Gabriel Ryan (6 papers)
  2. Siddhartha Jain (21 papers)
  3. Mingyue Shang (13 papers)
  4. Shiqi Wang (162 papers)
  5. Xiaofei Ma (31 papers)
  6. Murali Krishna Ramanathan (13 papers)
  7. Baishakhi Ray (88 papers)
Citations (23)
X Twitter Logo Streamline Icon: https://streamlinehq.com