ASTER: Natural and Multi-language Unit Test Generation with LLMs (2409.03093v3)
Abstract: Implementing automated unit tests is an important but time-consuming activity in software development. To assist developers in this task, many techniques for automating unit test generation have been developed. However, despite this effort, usable tools exist for very few programming languages. Moreover, studies have found that automatically generated tests suffer poor readability and do not resemble developer-written tests. In this work, we present a rigorous investigation of how LLMs can help bridge the gap. We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases. We illustrate how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking. We conducted an empirical study to assess the quality of the generated tests in terms of code coverage and test naturalness -- evaluating them on standard as well as enterprise Java applications and a large Python benchmark. Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved while also producing considerably more natural test cases that developers find easy to understand. We also present the results of a user study, conducted with 161 professional developers, that highlights the naturalness characteristics of the tests generated by our approach.
- ansible 2024. Ansible. https://github.com/ansible/ansible
- Andrea Arcuri. 2019. RESTful API automated test case generation with EvoMaster. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 1 (2019), 1–37.
- Andrea Arcuri and Lionel Briand. 2011. Adaptive Random Testing: An Illusion of Effectiveness?. In Proceedings of the 2011 International Symposium on Software Testing and Analysis. 265–275. https://doi.org/10.1145/2001420.2001452
- asterartifact 2024. ASTER Artifact. https://anonymous.4open.science/r/aster-54FC/
- Code generation tools (almost) for free? a study of few-shot, pre-trained language models on code. arXiv preprint arXiv:2206.01335 (2022).
- Ned Batchelder. [n. d.]. Coverage.py: Code coverage measurement for Python. https://coverage.readthedocs.io/. Accessed: 2024-07-27.
- JET Brains. 2024. Code With Me. https://www.jetbrains.com/code-with-me
- KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation. 209–224.
- cargotracker 2024. Eclipse Cargo Tracker. https://github.com/eclipse-ee4j/cargotracker
- Adaptive Random Testing: The ART of Test Case Diversity. J. Syst. Softw. 83, 1 (Jan. 2010), 60–66. https://doi.org/10.1016/j.jss.2009.02.022
- ARTOO: Adaptive Random Testing for Object-Oriented Software. In Proceedings of the 30th International Conference on Software Engineering. 71–80.
- codamosa 2024. CodaMosa. https://github.com/microsoft/codamosa
- codamosaartifact 2024. CodaMOSA Artifact. https://github.com/microsoft/codamosa/tree/main/replication
- commonscli 2024. Apache Commons CLI. https://github.com/apache/commons-cli
- commonscodec 2024. Apache Commons Codec. https://github.com/apache/commons-codec
- commonscompress 2024. Apache Commons Compress. https://github.com/apache/commons-compress
- commonsjxpath 2024. Apache Commons JXPath. https://github.com/apache/commons-jxpath
- Effective test generation using pre-trained large language models and mutation testing. arXiv preprint arXiv:2308.16557 (2023).
- daytrader 2024. DayTrader8 Sample. https://github.com/OpenLiberty/sample.daytrader8
- Machine Learning Applied to Software Testing: A Systematic Mapping Study. IEEE Transactions on Reliability 68, 3 (2019), 1189–1212. https://doi.org/10.1109/TR.2019.2892517
- EclEmma. [n. d.]. JaCoCo: Java Code Coverage Library. Accessed: 2024-07-27.
- evosuite 2024. EvoSuite: Automatic Test Suite Generation for Java. https://www.evosuite.org/
- flutes 2024. Flutes. https://github.com/huzecong/flutes
- Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: Automatic Test Suite Generation for Object-Oriented Software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419.
- Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study. ACM Trans. Softw. Eng. Methodol., Article 23 (Sept. 2015). https://doi.org/10.1145/2699688
- GitHub. 2024. GitHub Copilot. hhttps://github.com/features/copilot
- DART: Directed automated random testing. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. 213–223.
- Mark Harman and Phil McMinn. 2010. A Theoretical and Empirical Study of Search-Based Testing: Local, Global, and Hybrid Search. IEEE Transactions on Software Engineering 36, 2 (2010), 226–247. https://doi.org/10.1109/TSE.2009.71
- Automated test case generation using code models and domain adaptation. arXiv preprint arXiv:2308.08033 (2023).
- Anders Hovmöller. [n. d.]. Mutmut: Mutation Testing for Python. https://mutmut.readthedocs.io/. Accessed: 2024-07-27.
- IBM. 2024. Granite Code Models. https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330
- javaservletspec 2024. Jakarta Servlet. https://jakarta.ee/specifications/servlet
- Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931.
- Graph-Based Seed Object Synthesis for Search-Based Unit Testing. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1068–1080. https://doi.org/10.1145/3468264.3468619
- A Divergence-Oriented Approach to Adaptive Random Testing of Java Programs. In 2009 IEEE/ACM International Conference on Automated Software Engineering. 221–232. https://doi.org/10.1109/ASE.2009.13
- Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172.
- An empirical study of automated unit test generation for python. Empirical Software Engineering 28, 2 (2023). https://doi.org/10.1007/s10664-022-10248-w
- Phil McMinn. 2004. Search-based Software Test Data Generation: A Survey: Research Articles. Softw. Test. Verif. Reliab. 14, 2 (June 2004), 105–156.
- Meta. 2024. Meta Llama 3. https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6
- OpenAI. 2024a. OpenAI API. https://platform.openai.com/docs/api-reference/introduction
- OpenAI. 2024b. OpenAI Deprecations. https://platform.openai.com/docs/deprecations
- OpenAI. 2024c. OpenAI Models. https://platform.openai.com/docs/models/gpt-base
- Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816.
- Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 75–84.
- Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 523–533. https://doi.org/10.1109/ICSME46990.2020.00056
- tsDetect: an open source test smells detection tool. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1650–1654. https://doi.org/10.1145/3368089.3417921
- petclinic 2024. Spring PetClinic Sample Application. https://github.com/spring-projects/spring-petclinic
- PITEST. [n. d.]. PIT Mutation Testing. Accessed: 2024-07-27.
- Juan Altmayer Pizzorno and Emery D Berger. 2024. CoverUp: Coverage-Guided LLM-Based Test Generation. arXiv preprint arXiv:2403.16218 (2024).
- Corina S. Păsăreanu and Neha Rungta. 2010. Symbolic PathFinder: Symbolic Execution of Java Bytecode. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. 179–180. https://doi.org/10.1145/1858996.1859035
- pylint dev. [n. d.]. Pylint. Accessed: 2024-07-27.
- Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM. FSE (2024).
- An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering (2023).
- CUTE: A concolic unit testing engine for C. ACM SIGSOFT Software Engineering Notes 30, 5 (2005), 263–272.
- Using Large Language Models to Generate JUnit Tests: An Empirical Study. (2024).
- ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation. IEEE Transactions on Software Engineering (2024), 1–19. https://doi.org/10.1109/TSE.2024.3382365
- Nikolai Tillmann and Jonathan De Halleux. 2008. Pex–white box test generation for. net. In International conference on tests and proofs. Springer, 134–153.
- Paolo Tonella. 2004. Evolutionary Testing of Classes. In Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis. 119–128. https://doi.org/10.1145/1007512.1007528
- tornadoweb 2024. Tornado Web Server. https://github.com/tornadoweb/tornado
- treesitter 2024. Tree-sitter. https://tree-sitter.github.io/tree-sitter
- tsdetect 2024. TSDetect. https://github.com/TestSmells/TSDetect
- Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).
- Tackletest: A tool for amplifying test generation via type-based combinatorial coverage. In 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 444–455.
- Improving llm code generation with grammar augmentation. arXiv preprint arXiv:2403.01632 (2024).
- Can large language models write good property-based tests? arXiv preprint arXiv:2307.04346 (2023).
- Test Input Generation with Java PathFinder. In Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis. 97–107. https://doi.org/10.1145/1007512.1007526
- wala 2024. WALA. https://github.com/wala/WALA
- Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering (2024).
- Symstra: A Framework for Generating Object-Oriented Unit Tests Using Symbolic Execution. In Proceedings of the 11th International Conference on Tools and Algorithms for the Construction and Analysis of Systems. 365–381. https://doi.org/10.1007/978-3-540-31980-1_24
- ChatUniTest: a ChatGPT-based automated unit test generation tool. arXiv preprint arXiv:2305.04764 (2023).
- Evaluating and Improving ChatGPT for Unit Test Generation. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1703–1726.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.