Unit Testing Past vs. Present: Examining LLMs' Impact on Defect Detection and Efficiency (2502.09801v1)

Published 13 Feb 2025 in cs.SE

Abstract: The integration of LLMs, such as ChatGPT and GitHub Copilot, into software engineering workflows has shown potential to enhance productivity, particularly in software testing. This paper investigates whether LLM support improves defect detection effectiveness during unit testing. Building on prior studies comparing manual and tool-supported testing, we replicated and extended an experiment where participants wrote unit tests for a Java-based system with seeded defects within a time-boxed session, supported by LLMs. Comparing LLM supported and manual testing, results show that LLM support significantly increases the number of unit tests generated, defect detection rates, and overall testing efficiency. These findings highlight the potential of LLMs to improve testing and defect detection outcomes, providing empirical insights into their practical application in software testing.

PDF Abstract

Evaluating LLMs in Unit Testing: Enhancing Defect Detection and Efficiency

The paper, "Unit Testing Past vs. Present: Examining LLMs' Impact on Defect Detection and Efficiency," presents a detailed exploration of the incorporation of LLMs into software testing practices, with a notable focus on unit testing activities. Authored by Rudolf Ramler, Philipp Straubinger, Reinhold Plösch, and Dietmar Winkler, the paper examines the capacity of LLMs, such as ChatGPT and GitHub Copilot, to augment defect detection efficacy and the overall testing efficiency in software engineering workflows.

Research Objectives and Methodology

The primary goal of the paper is to investigate whether LLM integration enhances defect detection during unit testing. Building upon prior studies contrasting manual and tool-supported testing methods, the authors replicate and extend earlier experiments. The original experiment featured participants who were asked to generate unit tests for a Java-based system with intentionally seeded defects in a time-boxed environment. This paper broadens the scope by enabling the use of LLMs to assist participants in generating test cases. The experiment involves master's level students with theoretical and practical training in unit testing using JUnit, serving as a contemporary parallel to a previous paper that involved both students and professional developers.

Key Findings

The results indicate a significant boost in unit test creation and defect detection when participants use LLMs:

Volume of Tests: Participants using LLMs produced an average of 59.3 unit tests per participant, exceeding more than twice the average production of the manual testing control group.
Coverage: The LLM-supported testing achieves a branch coverage of 74%. There is a strong positive correlation between the number of tests created and the achieved branch coverage (r = 0.78), suggesting that the capability of LLMs to generate more tests directly contributes to higher coverage rates.
Defect Detection: The paper highlights that the LLM-supported group detected an average of 6.5 defects per participant, compared to the 3.7 defects found by the manual testing group, showcasing a substantial enhancement in defect detection.
False Positives: Although the increase in test numbers resulted in a higher number of false positive results (an average of 5.1 per participant), it was not attributed directly to the use of LLMs, suggesting that this was a byproduct of the sheer volume of tests.

Implications and Future Directions

The empirical results solidly suggest that LLMs, by improving the efficiency of test generation and defect detection, become integral to modern software testing processes. These quantitative enhancements may modernize unit testing practices significantly, signifying the most considerable development in this area over the past decade. Nevertheless, the paper observed potential challenges, such as the balance between the productivity achieved through automation and the quality control of false positives.

In terms of practical implications, the paper underscores the importance of developing best practices for LLM integration into testing workflows. As future work, the authors propose to escalate these findings by extending the replication across multiple sites with larger participant samples, offering a broader validation of the benefits and challenges that LLMs bring to software testing tasks. Additionally, a deeper analysis into the types of LLM tools and their specific utility across varied testing environments could further delineate their role in advancing software quality assurance processes.

Overall, this paper contributes significantly to the ongoing discourse on the role of artificial intelligence in software engineering, particularly within the unit testing domain, by presenting empirical evidence on the potential benefits and limitations of LLM application in defect detection and testing efficiency.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Rudolf Ramler (14 papers)
Philipp Straubinger (18 papers)
Reinhold Plösch (6 papers)
Dietmar Winkler (4 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/ComputerPapers/status/1891500834975854764