Identifying Patch Correctness in Test-Based Program Repair (1706.09120v3)

Published 28 Jun 2017 in cs.SE

Abstract: Test-based automatic program repair has attracted a lot of attention in recent years. However, the test suites in practice are often too weak to guarantee correctness and existing approaches often generate a large number of incorrect patches. To reduce the number of incorrect patches generated, we propose a novel approach that heuristically determines the correctness of the generated patches. The core idea is to exploit the behavior similarity of test case executions. The passing tests on original and patched programs are likely to behave similarly while the failing tests on original and patched programs are likely to behave differently. Also, if two tests exhibit similar runtime behavior, the two tests are likely to have the same test results. Based on these observations, we generate new test inputs to enhance the test suites and use their behavior similarity to determine patch correctness. Our approach is evaluated on a dataset consisting of 139 patches generated from existing program repair systems including jGenProg, Nopol, jKali, ACS and HDRepair. Our approach successfully prevented 56.3\% of the incorrect patches to be generated, without blocking any correct patches.

Citations (161)

View on Semantic Scholar

Summary

The paper introduces a novel approach using PATCH-SIM and TEST-SIM behavior similarity heuristics to assess the correctness of patches generated by test-based program repair systems.
Evaluation demonstrated the approach filtered 56.3% of incorrect patches from a dataset without removing any correct patches, surpassing existing patch classification metrics.
The approach has broader implications for addressing the oracle problem in software testing and enabling future research in areas like fault localization and human patch evaluation.

Identifying Patch Correctness in Test-Based Program Repair

The paper "Identifying Patch Correctness in Test-Based Program Repair" presents an innovative approach to enhance the precision of automated test-based program repair methodologies. It addresses a significant issue within automatic program repair systems: the generation of incorrect patches due to weak test suites. Automated program repair techniques modify faulty programs to ensure all given tests are passed, but these test suites often lack the robustness needed to confirm the fixes, leading to a proliferation of plausible yet incorrect patches.

Core Methodology

The novel approach proposed in the paper utilizes heuristics based on behavior similarity to assess patch correctness. The authors introduce two main heuristics:

PATCH-SIM: This heuristic posits that correct patches should alter the execution behavior of failing tests, while the passing tests should exhibit minimal behavior changes post-patching. This correlation allows for the determination of patch correctness by analyzing behavioral changes between the original and patched program during test execution.
TEST-SIM: This heuristic suggests that test inputs exhibiting similar runtime behavior to one another are likely to yield the same test results. TEST-SIM is used to categorize newly generated test inputs (which augment the test suites) as either likely to pass or likely to fail based on their execution similarity to existing tests.

The approach is demonstrated across several program repair systems, including jGenProg and Nopol, among others. It successfully filtered out 56.3% of incorrect patches from a dataset without excluding any correct patches, highlighting the superior precision enhancement strategy.

Evaluation and Results

The paper presents a robust evaluation of its methodology using a dataset of 139 patches. The approach effectively improved patch precision across different automated repair systems, marking a notable advancement over traditional metrics such as syntactic and semantic distance. Existing methods like anti-patterns and inherent oracle detection (Opad) were surpassed in terms of effectiveness, validating the potential of heuristics based on behavior similarity for patch classification.

Implications and Future Work

The implications of the paper extend beyond automated program repair, offering insights into addressing the oracle problem in software testing frameworks. By establishing behavior similarity as a viable measure for patch correctness, the authors open pathways for future research in integrating these heuristics in broader testing scenarios, perhaps even in fault localization and human patch evaluation.

While the approach did not exclude correct patches in its evaluations, the paper acknowledges the theoretical possibility of such occurrences. It suggests exploring more sophisticated spectrum analysis and classification techniques to differentiate correct versus incorrect patches more precisely.

In summary, the paper significantly contributes to the field of software engineering by refining methodologies for program repair validation, moving toward higher precision systems capable of autonomously validating patch correctness, thus enhancing the usability and reliability of automated program repair techniques. Further exploration could involve refining behavior comparison methodologies to better accommodate complex scenarios with significant codebase modifications.