Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts (1905.11973v1)

Published 28 May 2019 in cs.SE

Abstract: In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 5 benchmarks of bugs. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks of bugs. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts in total, which we used to find the causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their techniques and tools.

Authors (4)

Thomas Durieux (40 papers)
Fernanda Madeiral (15 papers)
Matias Martinez (51 papers)
Rui Abreu (29 papers)

Citations (117)

View on Semantic Scholar

Summary

The paper presents a large-scale evaluation of 11 Java repair tools using a unified framework, highlighting benchmark overfitting.
It assesses tool performance across five diverse benchmarks, showing significantly better results on Defects4J compared to others.
It identifies six key causes of failed patch generation and outlines future directions to enhance fault localization and repair strategies.

Empirical Review of Java Program Repair Tools: A Large-Scale Experiment

The paper "Empirical Review of Java Program Repair Tools" undertakes a comprehensive evaluation of the efficacy of 11 Java test-suite-based repair tools across five different benchmarks of bugs. This large-scale experiment encompasses 2,141 bugs and 23,551 repair attempts, offering a detailed insight into the landscape of current automatic program repair tools.

Overview and Methodology

The research aims to assess the extent to which repair tools can generalize their repairability across various benchmarks beyond the commonly used Defects4J. The authors selected 11 repair tools based on inclusion criteria such as availability and compatibility across different benchmarks. The paper is unique in its scope, comparing a broader range of tools and benchmarks than previous evaluations, which have typically focused exclusively on Defects4J. The benchmarks chosen include Bears, Bugs.jar, Defects4J, IntroClassJava, and QuixBugs, which together encompass a diverse array of Java projects.

To ensure uniformity in evaluating the tools' performance, the authors developed RepairThemAll, a framework that standardizes the execution of repair tools across different benchmarks. The framework facilitates a consistent environment for conducting repair attempts, accounting for issues such as differences in fault localization and execution settings. The experiment was conducted on a large scale, utilizing grid computing resources to handle the extensive computational requirements.

Findings

The results reveal that the repair tools demonstrate significant diversity in their ability to generate test-suite adequate patches. Notably, some tools like Nopol, DynaMoth, and ARJA appear more prolific, successfully generating patches for roughly 213, 206, and 146 bugs, respectively. In contrast, tools such as NPEFix exhibit limited repairability, attributed to their narrow focus on specific bug types, particularly null pointer exceptions.

The findings also underscore a phenomenon termed "benchmark overfitting," where repair tools exhibit superior performance on Defects4J compared to other benchmarks. The statistical analysis confirms that the repairability is significantly higher for defects from Defects4J, suggesting that prior evaluations may have overestimated the tools' generality. Three potential explanations for this are discussed: tuning of tools specifically for Defects4J, the nature of bug fixation isolation in Defects4J, and differences in bug type distribution across benchmarks.

Moreover, six major causes of non-patch generation were identified, including incorrect fault localization and technical limitations such as classpath issues. These insights provide guidance for the future development of more robust repair tools and highlight areas in need of refinement.

Implications and Future Directions

The implications of this paper are multifaceted, impacting both practical applications and theoretical advancements in automatic program repair. On the practical side, the recognition of benchmark overfitting challenges researchers to consider more diverse benchmarks in future evaluations of repair tools. For theoretical development, the insights gained into causes of non-patch generation may facilitate innovations targeting more sophisticated fault localization techniques and adaptable repair strategies capable of dealing with multi-location bugs and complex test environments.

Future research might address the hypotheses regarding why Defects4J is more easily patched by existing tools, perhaps by exploring the role of bug type distribution or by addressing the influence of isolated bug fixes on patch generation success. Additionally, further exploration into accommodating new repair tools and evaluating their performance on an even broader array of benchmarks could yield valuable information on improving the external validity of repair tools' performance.

In conclusion, this paper advances the field of automatic program repair by providing a detailed empirical review of the state of current Java repair tools, offering valuable insights into their limitations and possibilities for future improvement. The introduction of benchmark diversification sets a new standard for comprehensive tool evaluation, guiding future studies towards more holistic assessments of repair tool efficacy.

PDF Markdown

Related Papers

YouTube

Show All Videos