Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair (2309.00608v3)

Published 1 Sep 2023 in cs.SE, cs.LG, and cs.PL

Abstract: During Automated Program Repair (APR), it can be challenging to synthesize correct patches for real-world systems in general-purpose programming languages. Recent LLMs have been shown to be helpful "copilots" in assisting developers with various coding tasks, and have also been directly applied for patch synthesis. However, most LLMs treat programs as sequences of tokens, meaning that they are ignorant of the underlying semantics constraints of the target programming language. This results in plenty of statically invalid generated patches, impeding the practicality of the technique. Therefore, we propose Repilot, a general code generation framework to further copilot the AI "copilots" (i.e., LLMs) by synthesizing more valid patches during the repair process. Our key insight is that many LLMs produce outputs autoregressively (i.e., token by token), resembling human writing programs, which can be significantly boosted and guided through a Completion Engine. Repilot synergistically synthesizes a candidate patch through the interaction between an LLM and a Completion Engine, which 1) prunes away infeasible tokens suggested by the LLM and 2) proactively completes the token based on the suggestions provided by the Completion Engine. Our evaluation on a subset of the widely-used Defects4j 1.2 and 2.0 datasets shows that Repilot outperforms state-of-the-art techniques by fixing 27% and 47% more bugs, respectively. Moreover, Repilot produces more valid and correct patches than the base LLM with the same budget. While we focus on leveraging Repilot for APR in this work, the overall approach is also generalizable to other code generation tasks.

An Essay on "Copiloting the Copilots: Fusing LLMs with Completion Engines for Automated Program Repair"

Automated Program Repair (APR) presents a challenging problem requiring sophisticated methods to synthesize valid and correct patches in software systems. The utilization of LLMs as "copilots" has gained traction in this domain due to their efficacy in generating code based on training on substantial programming corpuses. However, these models often operate on sequences of tokens, neglecting the static semantics of programming languages, thus producing a considerable rate of semantically invalid patches. To address this limitation, the discussed paper introduces "Repilot," a framework that synergizes LLMs with a Completion Engine to enhance the validity of APR-generated patches significantly.

Contribution

The paper's primary contribution lies in the proposition of Repilot, a framework explicitly designed to coalesce LLMs with Completion Engines. This integration enables the framework to prune infeasible tokens and leverage real-time suggestions from the Completion Engine, thus producing more valid patches. The approach is underpinned by treating the code generation process similarly to how human developers write code, using incremental and autocompletion strategies. By actively engaging a Completion Engine, Repilot serves to not only ensure the semantic validity of the patches but also expedite the correction process by directly completing complex identifiers that are cumbersome for LLMs to generate.

Key Methodology

Repilot adopts a two-pronged approach. Initially, during the token-by-token synthesis, it employs the LLM to predict probable continuations in the code sequence. Concurrently, it invokes a Completion Engine to verify the feasibility of these continuations from a static semantics perspective. This methodology enables Repilot to effectively sift through the model’s suggestions, discarding those that are statically invalid. Moreover, when only one valid continuation exists, the system proactively completes the token sequence using the Completion Engine’s ability to suggest completions, enhancing both performance and accuracy.

Empirical Evidence

The empirical results, derived from using subsets of the Defects4J 1.2 and 2.0 datasets, overwhelmingly demonstrate the superiority of Repilot over state-of-the-art techniques. Notably, it yields a 27% and 47% increase in fixed bugs on Defects4J 1.2 and 2.0, respectively, compared to the closest competitors. These significant improvements underscore the potential effectiveness of integrating LLMs with complementary static analysis tools such as Completion Engines.

Implications and Future Work

The implications of this research are profound, not merely confined to APR but extendable to broader code synthesis and generation tasks. By showcasing its adaptability to two different LLM architectures, CodeT5 and InCoder, the authors illustrate the framework's versatility and potential applicability to a range of programming languages, particularly those with robust type systems. Future developments could pivot towards accommodating dynamically typed programming languages through enriched type-analysis capacities of Completion Engines.

Conclusion

The paper articulates a compelling thesis on enhancing APR through the use of LLMs in conjunction with static semantics analysis tools. This synergistic approach addresses existing bottlenecks in producing valid and correct patches by guiding LLMs via Completion Engines, making notable strides in the field of software engineering. Future explorations and expanded evaluations across diverse programming environments could further cement Repilot's role as a pivotal component in automated code generation and repair systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yuxiang Wei (40 papers)
  2. Chunqiu Steven Xia (13 papers)
  3. Lingming Zhang (48 papers)
Citations (76)