Papers
Topics
Authors
Recent
2000 character limit reached

ASTER: Natural and Multi-language Unit Test Generation with LLMs (2409.03093v3)

Published 4 Sep 2024 in cs.SE

Abstract: Implementing automated unit tests is an important but time-consuming activity in software development. To assist developers in this task, many techniques for automating unit test generation have been developed. However, despite this effort, usable tools exist for very few programming languages. Moreover, studies have found that automatically generated tests suffer poor readability and do not resemble developer-written tests. In this work, we present a rigorous investigation of how LLMs can help bridge the gap. We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We illustrate how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. We conducted an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness -- evaluating them on standard as well as enterprise Java applications and a large Python benchmark. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved while also producing considerably more natural test cases that developers find easy to understand. We also present the results of a user study, conducted with 161 professional developers, that highlights the naturalness characteristics of the tests generated by our approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. ansible 2024. Ansible. https://github.com/ansible/ansible
  2. Andrea Arcuri. 2019. RESTful API automated test case generation with EvoMaster. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 1 (2019), 1–37.
  3. Andrea Arcuri and Lionel Briand. 2011. Adaptive Random Testing: An Illusion of Effectiveness?. In Proceedings of the 2011 International Symposium on Software Testing and Analysis. 265–275. https://doi.org/10.1145/2001420.2001452
  4. asterartifact 2024. ASTER Artifact. https://anonymous.4open.science/r/aster-54FC/
  5. Code generation tools (almost) for free? a study of few-shot, pre-trained language models on code. arXiv preprint arXiv:2206.01335 (2022).
  6. Ned Batchelder. [n. d.]. Coverage.py: Code coverage measurement for Python. https://coverage.readthedocs.io/. Accessed: 2024-07-27.
  7. JET Brains. 2024. Code With Me. https://www.jetbrains.com/code-with-me
  8. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation. 209–224.
  9. cargotracker 2024. Eclipse Cargo Tracker. https://github.com/eclipse-ee4j/cargotracker
  10. Adaptive Random Testing: The ART of Test Case Diversity. J. Syst. Softw. 83, 1 (Jan. 2010), 60–66. https://doi.org/10.1016/j.jss.2009.02.022
  11. ARTOO: Adaptive Random Testing for Object-Oriented Software. In Proceedings of the 30th International Conference on Software Engineering. 71–80.
  12. codamosa 2024. CodaMosa. https://github.com/microsoft/codamosa
  13. codamosaartifact 2024. CodaMOSA Artifact. https://github.com/microsoft/codamosa/tree/main/replication
  14. commonscli 2024. Apache Commons CLI. https://github.com/apache/commons-cli
  15. commonscodec 2024. Apache Commons Codec. https://github.com/apache/commons-codec
  16. commonscompress 2024. Apache Commons Compress. https://github.com/apache/commons-compress
  17. commonsjxpath 2024. Apache Commons JXPath. https://github.com/apache/commons-jxpath
  18. Effective test generation using pre-trained large language models and mutation testing. arXiv preprint arXiv:2308.16557 (2023).
  19. daytrader 2024. DayTrader8 Sample. https://github.com/OpenLiberty/sample.daytrader8
  20. Machine Learning Applied to Software Testing: A Systematic Mapping Study. IEEE Transactions on Reliability 68, 3 (2019), 1189–1212. https://doi.org/10.1109/TR.2019.2892517
  21. EclEmma. [n. d.]. JaCoCo: Java Code Coverage Library. Accessed: 2024-07-27.
  22. evosuite 2024. EvoSuite: Automatic Test Suite Generation for Java. https://www.evosuite.org/
  23. flutes 2024. Flutes. https://github.com/huzecong/flutes
  24. Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Generation for Object-Oriented Software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419.
  25. Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study. ACM Trans. Softw. Eng. Methodol., Article 23 (Sept. 2015). https://doi.org/10.1145/2699688
  26. GitHub. 2024. GitHub Copilot. hhttps://github.com/features/copilot
  27. DART: Directed automated random testing. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. 213–223.
  28. Mark Harman and Phil McMinn. 2010. A Theoretical and Empirical Study of Search-Based Testing: Local, Global, and Hybrid Search. IEEE Transactions on Software Engineering 36, 2 (2010), 226–247. https://doi.org/10.1109/TSE.2009.71
  29. Automated test case generation using code models and domain adaptation. arXiv preprint arXiv:2308.08033 (2023).
  30. Anders Hovmöller. [n. d.]. Mutmut: Mutation Testing for Python. https://mutmut.readthedocs.io/. Accessed: 2024-07-27.
  31. IBM. 2024. Granite Code Models. https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330
  32. javaservletspec 2024. Jakarta Servlet. https://jakarta.ee/specifications/servlet
  33. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931.
  34. Graph-Based Seed Object Synthesis for Search-Based Unit Testing. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1068–1080. https://doi.org/10.1145/3468264.3468619
  35. A Divergence-Oriented Approach to Adaptive Random Testing of Java Programs. In 2009 IEEE/ACM International Conference on Automated Software Engineering. 221–232. https://doi.org/10.1109/ASE.2009.13
  36. Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172.
  37. An empirical study of automated unit test generation for python. Empirical Software Engineering 28, 2 (2023). https://doi.org/10.1007/s10664-022-10248-w
  38. Phil McMinn. 2004. Search-based Software Test Data Generation: A Survey: Research Articles. Softw. Test. Verif. Reliab. 14, 2 (June 2004), 105–156.
  39. Meta. 2024. Meta Llama 3. https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6
  40. OpenAI. 2024a. OpenAI API. https://platform.openai.com/docs/api-reference/introduction
  41. OpenAI. 2024b. OpenAI Deprecations. https://platform.openai.com/docs/deprecations
  42. OpenAI. 2024c. OpenAI Models. https://platform.openai.com/docs/models/gpt-base
  43. Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816.
  44. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 75–84.
  45. Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 523–533. https://doi.org/10.1109/ICSME46990.2020.00056
  46. tsDetect: an open source test smells detection tool. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1650–1654. https://doi.org/10.1145/3368089.3417921
  47. petclinic 2024. Spring PetClinic Sample Application. https://github.com/spring-projects/spring-petclinic
  48. PITEST. [n. d.]. PIT Mutation Testing. Accessed: 2024-07-27.
  49. Juan Altmayer Pizzorno and Emery D Berger. 2024. CoverUp: Coverage-Guided LLM-Based Test Generation. arXiv preprint arXiv:2403.16218 (2024).
  50. Corina S. Păsăreanu and Neha Rungta. 2010. Symbolic PathFinder: Symbolic Execution of Java Bytecode. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 179–180. https://doi.org/10.1145/1858996.1859035
  51. pylint dev. [n. d.]. Pylint. Accessed: 2024-07-27.
  52. Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM. FSE (2024).
  53. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering (2023).
  54. CUTE: A concolic unit testing engine for C. ACM SIGSOFT Software Engineering Notes 30, 5 (2005), 263–272.
  55. Using Large Language Models to Generate JUnit Tests: An Empirical Study. (2024).
  56. ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation. IEEE Transactions on Software Engineering (2024), 1–19. https://doi.org/10.1109/TSE.2024.3382365
  57. Nikolai Tillmann and Jonathan De Halleux. 2008. Pex–white box test generation for. net. In International conference on tests and proofs. Springer, 134–153.
  58. Paolo Tonella. 2004. Evolutionary Testing of Classes. In Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis. 119–128. https://doi.org/10.1145/1007512.1007528
  59. tornadoweb 2024. Tornado Web Server. https://github.com/tornadoweb/tornado
  60. treesitter 2024. Tree-sitter. https://tree-sitter.github.io/tree-sitter
  61. tsdetect 2024. TSDetect. https://github.com/TestSmells/TSDetect
  62. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).
  63. Tackletest: A tool for amplifying test generation via type-based combinatorial coverage. In 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 444–455.
  64. Improving llm code generation with grammar augmentation. arXiv preprint arXiv:2403.01632 (2024).
  65. Can large language models write good property-based tests? arXiv preprint arXiv:2307.04346 (2023).
  66. Test Input Generation with Java PathFinder. In Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis. 97–107. https://doi.org/10.1145/1007512.1007526
  67. wala 2024. WALA. https://github.com/wala/WALA
  68. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering (2024).
  69. Symstra: A Framework for Generating Object-Oriented Unit Tests Using Symbolic Execution. In Proceedings of the 11th International Conference on Tools and Algorithms for the Construction and Analysis of Systems. 365–381. https://doi.org/10.1007/978-3-540-31980-1_24
  70. ChatUniTest: a ChatGPT-based automated unit test generation tool. arXiv preprint arXiv:2305.04764 (2023).
  71. Evaluating and Improving ChatGPT for Unit Test Generation. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1703–1726.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.