A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark (1805.03454v4)

Published 9 May 2018 in cs.SE

Abstract: Automatic program repair papers tend to repeatedly use the same benchmarks. This poses a threat to the external validity of the findings of the program repair research community. In this paper, we perform an empirical study of automatic repair on a benchmark of bugs called QuixBugs, which has been little studied. In this paper, 1) We report on the characteristics of QuixBugs; 2) We study the effectiveness of 10 program repair tools on it; 3) We apply three patch correctness assessment techniques to comprehensively study the presence of overfitting patches in QuixBugs. Our key results are: 1) 16/40 buggy programs in QuixBugs can be repaired with at least a test suite adequate patch; 2) A total of 338 plausible patches are generated on the QuixBugs by the considered tools, and 53.3% of them are overfitting patches according to our manual assessment; 3) The three automated patch correctness assessment techniques, RGT_Evosuite, RGT_InputSampling and GT_Invariants, achieve an accuracy of 98.2%, 80.8% and 58.3% in overfitting detection, respectively. To our knowledge, this is the largest empirical study of automatic repair on QuixBugs, combining both quantitative and qualitative insights. All our empirical results are publicly available on GitHub in order to facilitate future research on automatic program repair.

Authors (4)

He Ye (16 papers)
Matias Martinez (51 papers)
Thomas Durieux (40 papers)
Martin Monperrus (155 papers)

Citations (91)

View on Semantic Scholar

Summary

Insights from "A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark"

The paper entitled "A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark" sets forth a thorough exploration of the efficacy of various program repair tools on the QuixBugs benchmark, a suite of bugs that has not been extensively studied. The use of QuixBugs is motivated by the risk of overfitting to more commonly used benchmarks in the field of automatic program repair, which can undermine the external validity of research findings.

Key Findings and Methodology

The cornerstone of this paper is an empirical analysis involving ten state-of-the-art program repair tools. These tools were applied to QuixBugs, a collection of forty buggy programs implemented in both Python and Java. The investigation was structured around several research questions, focusing on the characteristics of QuixBugs, the repairability of its bugs, and the presence of overfitting patches.

Benchmark Characteristics: QuixBugs is comprised of a diverse array of bug types within small programs. Detailed characteristics such as the nature of bugs (e.g., incorrect logical operators, incorrect comparison operators) and failure symptoms (e.g., stack overflow, incorrect output) are thoroughly categorized.
Repair Effectiveness: Among the four programs with no passing tests repaired by JGenProg, Arja, and RSRepair, QuixBugs specifically enables the assessment of fix generation without relying on passing tests. A notable result is that 16 out of 40 buggy programs were repairable, confirming the versatility of these tools over multiple bug types.
Patch Correctness and Overfitting: The paper reveals that a significant proportion of patches (53.3%) are overfitting, failing to generalize beyond the provided test suite. To assess patch correctness, the paper incorporates three assessment techniques—manual assessment, engineered test generation, and automated correctness checking. This spotlight on overfitting is crucial, as it underlines the limitations of relying solely on test suites as oracles for program correctness.
Automated Patch Correctness Assessments: Three automated assessment techniques were analyzed for their efficacy in discerning overfitting patches. $RGT_{Evosuite}$ emerged as the most accurate with 98.2% accuracy, highlighting its potential for robust patch validation.

Implications and Future Directions

The comprehensive nature of this paper on the QuixBugs benchmark not only extends the landscape of automatic program repair beyond the traditional scope but also enhances the generalizability of existing insights regarding overfitting and tool performance. Moreover, the provision of all empirical results on GitHub underscores a commitment to fostering transparency and catalyzing further research.

The paper sets a precedent for leveraging underexplored benchmarks in the evaluation of automated program repair tools, thereby reducing the field's reliance on conventional datasets that may not fully challenge existing techniques. Future research might expand upon this by exploring multi-bug fixes and enhancements in patch assessment methodologies to further mitigate overfitting risks and improve reliability. These expansions could substantially enrich the current understanding and capability of automatic program repair systems.

In summary, the paper sheds light on the intricacies of automatic program repair with a focus on overcoming prevalent challenges such as overfitting. It encourages the community to diversify the benchmarks used in experimental evaluations, ensuring a comprehensive exploration of the efficacy and limitations of repair tools in varied contexts.

PDF Markdown

Related Papers

GitHub

GitHub - jkoppel/QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge (123 stars)