Automatic Repair of Real Bugs in Java: A Large-Scale Experiment on the Defects4J Dataset (1811.02429v1)

Published 4 Nov 2018 in cs.SE

Abstract: Defects4J is a large, peer-reviewed, structured dataset of real-world Java bugs. Each bug in Defects4J comes with a test suite and at least one failing test case that triggers the bug. In this paper, we report on an experiment to explore the effectiveness of automatic test-suite based repair on Defects4J. The result of our experiment shows that the considered state-of-the-art repair methods can generate patches for 47 out of 224 bugs. However, those patches are only test-suite adequate, which means that they pass the test suite and may potentially be incorrect beyond the test-suite satisfaction correctness criterion. We have manually analyzed 84 different patches to assess their real correctness. In total, 9 real Java bugs can be correctly repaired with test-suite based repair. This analysis shows that test-suite based repair suffers from under-specified bugs, for which trivial or incorrect patches still pass the test suite. With respect to practical applicability, it takes on average 14.8 minutes to find a patch. The experiment was done on a scientific grid, totaling 17.6 days of computation time. All the repair systems and experimental results are publicly available on Github in order to facilitate future research on automatic repair.

Authors (5)

Matias Martinez (51 papers)
Thomas Durieux (40 papers)
Romain Sommerard (2 papers)
Jifeng Xuan (25 papers)
Martin Monperrus (155 papers)

Citations (241)

View on Semantic Scholar

Summary

Overview of Automatic Repair of Real Bugs in Java: An Experiment with Defects4J

This paper presents a detailed empirical analysis exploring the efficacy of test-suite based automated repair methods applied to real-world Java bugs, leveraging the Defects4J dataset. The research investigates the capability of contemporary repair algorithms to generate patches for a significant number of real, non-trivial bugs drawn from four Java projects within the dataset, offering insights that contribute both to practical software engineering applications and theoretical understanding of automated repair.

Key Findings and Methodology

Researchers employed three state-of-the-art automated repair systems: jGenProg, jKali, and Nopol, to evaluate their effectiveness on Defects4J, a dataset encompassing 224 real-world Java bugs spread over 231K lines of code. Each bug is accompanied by a corresponding test suite consisting of passing and failing test cases. The experiment aimed at answering four specific research questions (RQs):

Synthesize patches for bugs:
- Across all systems, patches were successfully generated for 47 out of the 224 bugs (21%).
- Nopol showed the highest individual success, generating patches for 35 bugs.
Patch correctness:
- A manual analysis of 84 generated patches indicated that only 11 (approximately 13%) were genuinely correct, confirming a significant overfitting tendency where modifications fit only the provided test cases but do not resolve the bug correctly beyond that scope.
Under-specified bugs identification:
- The paper highlighted about 21 bugs as under-specified in the test-suite context, meaning trivial patches still pass due to insufficient testing scope.
- These bugs represent critical challenges for future repair techniques that wish to surpass the limitations of current methodologies.
Execution time measurement:
- The automation processes on average required 14.8 minutes to produce a patch per bug on a scientific grid, suggesting a feasible computation time acceptable for practical applications.

Implications for Future Research

The exploration unveils that while current automatic repair systems demonstrate potential, they are hindered by inadequate test suites that lead to a prevalence of incorrect patch synthesis. The observations provoke several implications for advancing automatic repair:

Enhancing Test Suites:

Effective repair largely depends on test suite quality. Strengthening test suites with more comprehensive and rigorous test cases can directly impact the correctness and reliability of generated patches.

Reducing Overfitting:

There's a need for developing advanced algorithms capable of synthesizing patches that generalize beyond the narrow test data provided, reducing the dependence on specific test inputs and consequently mitigating overfitting.

Broader Reasoning Capabilities:

Future work should focus on empowering repair algorithms with reasoning capabilities that interpret and infer desired program behavior even when it's not explicitly specified, potentially through integrating additional sources of information or heuristics.

Exploring and Comparing Efficiency:

Given that this paper doesn't cover all possible techniques, extending comparisons to include other unreleased repair systems could yield additional insights. The development of benchmarks and models to rank and prioritize patch attempts would optimize research efforts and resources.

Conclusion

In conclusion, this paper highlights the present capabilities and limitations of automated repair systems based on a substantial experimental effort with real-world data. The research underscores the crucial role of test suites in automated bug fixing and lays the groundwork for numerous directions in optimizing and evolving automated software repair interventions. As improvements in these areas evolve, the reduced human burden of patching software by leveraging automation represents a promising prospect in software maintenance and evolution.

PDF Markdown