A Survey of Learning-based Automated Program Repair (2301.03270v3)

Published 9 Jan 2023 in cs.SE

Abstract: Automated program repair (APR) aims to fix software bugs automatically and plays a crucial role in software development and maintenance. With the recent advances in deep learning (DL), an increasing number of APR techniques have been proposed to leverage neural networks to learn bug-fixing patterns from massive open-source code repositories. Such learning-based techniques usually treat APR as a neural machine translation (NMT) task, where buggy code snippets (i.e., source language) are translated into fixed code snippets (i.e., target language) automatically. Benefiting from the powerful capability of DL to learn hidden relationships from previous bug-fixing datasets, learning-based APR techniques have achieved remarkable performance. In this paper, we provide a systematic survey to summarize the current state-of-the-art research in the learning-based APR community. We illustrate the general workflow of learning-based APR techniques and detail the crucial components, including fault localization, patch generation, patch ranking, patch validation, and patch correctness phases. We then discuss the widely-adopted datasets and evaluation metrics and outline existing empirical studies. We discuss several critical aspects of learning-based APR techniques, such as repair domains, industrial deployment, and the open science issue. We highlight several practical guidelines on applying DL techniques for future APR studies, such as exploring explainable patch generation and utilizing code features. Overall, our paper can help researchers gain a comprehensive understanding about the achievements of the existing learning-based APR techniques and promote the practical application of these techniques. Our artifacts are publicly available at \url{https://github.com/QuanjunZhang/AwesomeLearningAPR}.

PDF Abstract

Learning-based Automated Program Repair: A Comprehensive Survey

"A Survey of Learning-based Automated Program Repair," authored by Quanjun Zhang and colleagues, provides a robust overview of the integration of deep learning (DL) into automated program repair (APR). This paper meticulously surveys recent advancements in learning-based APR techniques, offering an in-depth exploration of the methodologies, datasets, metrics, and challenges intrinsic to this emerging field.

Automated program repair aims to identify and rectify software bugs autonomously, thereby significantly reducing manual debugging efforts. With the advent of DL, learning-based APR techniques have gained momentum, leveraging large corpora of source code to learn bug-fixing patterns. These approaches typically model APR as a neural machine translation (NMT) task, transforming buggy code snippets (source language) into corrected code snippets (target language).

Methodological Framework

The paper delineates a typical learning-based APR workflow comprising several key phases: fault localization, data pre-processing, patch generation, patch ranking, patch validation, and patch correctness assessment. This structured workflow underscores the complex interdependencies and technical requirements of successfully integrating DL into APR.

Fault Localization: Effective bug localization is crucial as it guides the subsequent stages of repair. Through spectrum-based fault localization techniques, buggy code elements are identified, forming the foundation upon which learning-based models operate.
Data Pre-processing: This involves context extraction, code abstraction, and tokenization, which significantly influence the model's ability to discern the latent bug-fixing patterns within large data sets.
Patch Generation: At this stage, sequence-to-sequence models, often enhanced by advancements in architectures like RNN, LSTM, and Transformer, are employed to predict fixes. The choice of model architecture determines the ability to capture long-distance relationships within code sequences, influencing repair accuracy.
Patch Ranking and Validation: Beam search is commonly used to highlight patches with higher probabilities of correctness, while rigorous validation using existing test suites ensures the reliability of proposed fixes.
Patch Correctness: A substantive concern in APR is overfitting, which refers to generated patches that pass available tests but fail to generalize. Addressing this involves employing both static and dynamic analysis techniques.

Empirical Evaluation

Across various benchmarks, including Defects4J and QuixBugs, learning-based techniques have demonstrated notable success, reflecting the enhanced potential brought forth by DL. However, these approaches entail considerable computational resources during training, alongside a need for high-quality datasets—often mined from open-source repositories.

Future Directions

The survey identifies critical areas for continued research and improvement:

Code Representation: Developing optimal representations that capture both syntax and semantics while reducing processing overhead.
Patch Validation Accelerations: Enhancing efficiency in validating candidate patches to expedite the overall repair process.
Pre-Trained Models: Leveraging pre-trained models with robust fine-tuning to cater specifically to APR tasks, optimizing both accuracy and scalability.
Cross-Disciplinary Applications: Extending methodologies to diverse domains including API misuse, syntax errors, and security vulnerabilities, to enhance the generalizability of learning-based APR approaches.

Conclusion

The paper by Zhang et al. offers a comprehensive synthesis of the current state of learning-based APR, elucidating both its triumphs and its challenges. By dissecting the intricacies of each component within the APR process and emphasizing the synergy between traditional and learning-based techniques, the paper paves the way for future innovations. As the field progresses, integrating insights from both software engineering and artificial intelligence will be indispensable in overcoming the persistent challenges of automated program repair, ultimately advancing the efficiency and robustness of software maintenance practices.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Quanjun Zhang (36 papers)
Chunrong Fang (71 papers)
Yuxiang Ma (8 papers)
Weisong Sun (45 papers)
Zhenyu Chen (91 papers)

Citations (70)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - iSEngLab/AwesomeLearningAPR: [TOSEM 2023] A Survey of Learning-based Automated Program Repair (69 stars)