ChatRepair: Conversational Code Repair

Updated 2 October 2025

ChatRepair is a conversational automated program repair framework that interleaves candidate patch generation with dynamic validation and feedback.
The methodology integrates detailed error diagnostics and iterative context management to reduce redundant fixes and improve repair accuracy.
Empirical results on standard benchmarks demonstrate improved repair efficiency and cost-effectiveness compared to static prompt-based methods.

ChatRepair refers to a class of conversational, feedback-driven techniques for automated program repair (APR), exemplified by leveraging LLMs such as ChatGPT in a loop that interleaves candidate patch generation with real-time validation and iterative improvement. The paradigm represents a departure from traditional “generate-and-validate” APR by exploiting LLMs’ long-context capabilities and providing structured, multi-turn interactions that incorporate failure diagnostics and prior attempts to enhance repair accuracy, efficiency, and generality.

1. Principled Paradigm: Conversation-driven Automated Program Repair

ChatRepair operationalizes a conversational APR framework in which the repair process alternates patch generation and validation in iterative “turns.” In each turn, the LLM receives input comprising the original buggy code, previously generated candidate patches (and possibly their outcomes), and detailed validation feedback—such as failing test cases, outputs, and error messages—thereby constructing an evolving conversational context. After a patch is sampled from the LLM, immediate validation is performed (e.g., by running tests); the result, whether failure or success, is appended with contextual feedback for the next interaction. This feedback loop continues until a correct patch is found or conversation termination criteria (e.g., maximum turns) are met (Xia et al., 2023, Xia et al., 2023).

This conversation-driven strategy stands in stark contrast to previous APR approaches where the same static prompt is repeatedly presented to the LLM—often resulting in duplicate or semantically identical incorrect patches. By serializing the history of repair attempts and validation outcomes, ChatRepair leverages the long-context window of modern LLMs, enabling the system to avoid redundant fixes and utilize the semantic richness of error messages and test results.

2. Methodology and Implementation Details

The canonical implementation of ChatRepair employs a Python-based orchestration layer that invokes LLMs, primarily through APIs (e.g., OpenAI gpt-3.5-turbo), following these steps:

Initialization: An initial prompt is constructed, combining the buggy code and salient information about the failure (test name, failure line, and error message).
Patch Generation: ChatGPT is instructed to produce a code patch, optionally through an infilling token or by rewriting an entire function/method.
Validation: Once a patch is returned, it is compiled/executed and validated against a full suite of tests.
Feedback Incorporation: Failure details are automatically crafted into a feedback message (e.g., “the patch does not resolve test XYZ, see error: ...”) and appended to the running chat context for subsequent queries.
Iterative Refinement: This process iterates, maintaining a sliding window over conversation history to fit within context window limits. If a patch passes all tests (a plausible fix), the system may instruct the LLM to produce alternative valid variants, increasing the chance of an exact or correct fix.
Formal Models: The iterative process is formally captured by a conditional sequence model:

$C(p_i | I, {<i})\,,\text{where } I \text{ is the initial prompt, and } {<i} \text{ is the feedback-and-patch history}.$

(Xia et al., 2023)

Algorithmic summaries and prompt design templates are provided to codify these steps into robust, reproducible repair pipelines.

3. Performance Metrics, Empirical Results, and Scalability

ChatRepair has been evaluated on benchmarks such as Defects4j (v1.2, v2.0) and QuixBugs, using standard metrics: number of correct/semantically-equivalent patches, number of plausible patches (pass all tests), and efficiency (e.g., average tries/API calls per fix) (Xia et al., 2023). Empirical results include:

Dataset	Bugs Correctly Fixed	Cost Per Bug	Notable Baselines
Defects4j v1.2	114	$~0.42</td> <td>AlphaRepair, CodexRepair</td> </tr> <tr> <td>Defects4j v2.0</td> <td>48</td> <td>$ ~0.42	SelfAPR, CodexRepair
QuixBugs	High correct-fix	N/A	BaseChatGPT

Notably, ChatRepair achieves state-of-the-art results (e.g., 162/337 bugs fixed in total for Defects4j 1.2+2.0 at $0.42 per bug run via ChatGPT), requiring substantially fewer generations compared to static prompt baselines, and outperforms earlier template-based and LLM-based APR methods.

Scalability is enhanced through automated conversation management and dynamic prompt reduction, though context window size remains a limiting factor. The ReduceFix enhancement shows that automatic input reduction (shrinking test input size by 89.1% on average) yields a 21.3% relative increase in fix rate for ChatRepair on long-input benchmarks such as LFTBench (Yang et al., 21 Jul 2025).

4. Impact of Prompt Design, Feedback Integration, and Iteration

The effectiveness of ChatRepair is strongly influenced by prompt engineering and validation feedback construction. Explicit inclusion of rich, functional test failure feedback (e.g., “when input X, got Y, expected Z”) outperforms generic error messages or static code diffs (Xia et al., 2023, Xia et al., 2023). Prompt templates that encode code intention and task context (as inspired by requirement engineering) substantially boost repair success, as demonstrated in deep learning code repair (Cao et al., 2023).

Iterative dialogue—where each failed or successful patch is appended with targeted, concise feedback—encourages the LLM to diversify its attempts and internalize previous mistakes, thus converging more rapidly to plausible and correct repairs. Alternative plausible patches (variants built atop successful repairs) further improve the chance of matching the oracle, surfacing fix diversity that would be absent in generate-and-validate-only methods.

However, recent empirical analyses highlight that naively increasing iterations does not always yield superior repair outcomes. For example, 65% of patches across iterations may be duplicates, and simple repeated independent runs can occasionally match or surpass the iterative refinement in patch quality (Chen et al., 19 Mar 2025). The diversity and precision of feedback, not simply iteration count, are critical.

5. Challenges, Limitations, and Analysis of Failure Modes

Despite improvements over traditional baselines, ChatRepair exhibits the following limitations:

Fault Localization Assumption: ChatRepair typically presumes the buggy region is known/provided. It is not a replacement for localization techniques.
Memorization and Generalization Gaps: Evaluations using the BloomAPR framework reveal heavy reliance on memorization at the “Remember” cognitive layer (fixing up to 81.57% of known bugs), but substantial performance drops when minor syntactic variations or context shifts are introduced (e.g., only 13.46–41.34% bugs solved in the Analyze layer with injected bugs in real projects) (Ma et al., 29 Sep 2025).
Sensitivity to Prompt and Validation Structure: The approach is sensitive to the way test cases, error messages, or context are presented. Providing overly long failure-inducing inputs may trigger the “lost-in-the-middle” effect in LLMs, compromising repair performance (Yang et al., 21 Jul 2025).
Limited Reasoning on Complex Repairs: Failures frequently stem from LLMs misunderstanding the root cause of failure (66% of failed repairs), inability to infer expected program behaviors from tests, or lacking access to “fix ingredients” outside the immediate code context (only 45% correct repairs where necessary information is external to the function) (Chen et al., 19 Mar 2025).
Catastrophic Forgetting and Prompt Length: Iterative dialogue sometimes suffers from forgetting earlier hints, especially when the conversation window is long or too much auxiliary information is included (Cao et al., 2023).
Data Contamination Risks: Prevalent use of static benchmarks in LLM pretraining may artificially inflate observed repair performance (Ma et al., 29 Sep 2025).

These findings indicate a need for robust context management, enhanced problem understanding mechanisms, adaptive prompt construction, and integration with explicit localization or context-providing modules.

6. Extensions, Applications, and Future Research

The conversational repair paradigm underlying ChatRepair is extensible to a variety of repair and refinement domains:

Beyond Classical Code Repair: Adaptations of ChatRepair have demonstrated repair effectiveness in deep learning code (Cao et al., 2023), quantum program repair (with 76.3% success on Bugs4Q under the TOP-5 metric) (Guo et al., 26 Jan 2024), and web UI test script repair where LLM explanation consistency checks enhance locator repair accuracy (Xu et al., 2023).
Contrastive Feedback and Enhanced Prompting: Techniques such as ContrastRepair, which augment prompts with minimal-difference contrastive test pairs (high Damerau–Levenshtein similarity between failing/passing tests), improve both bug localization and fix rates, achieving new state-of-the-art repair numbers on multiple benchmarks (Kong et al., 4 Mar 2024).
Requirement and Specification Refinement: Prompt refinement via chat-guided clarification (as in ChatCoder) enables LLMs to generate higher-quality code by iteratively soliciting and integrating clarification questions and corrections from human users (Wang et al., 2023).
Organizational Knowledge Repair: ChatRepair-inspired conversational frameworks are also deployed in collaborative information management, e.g., CHOIR, where LLMs interleave chat consensus and document synchronization in organizational memory systems (Lee et al., 20 Feb 2025).
Input Preprocessing for Long Test Cases: Automatic input minimization via LLM-generated reducers, as in ReduceFix, specifically mitigates context window issues for large, failure-inducing tests and demonstrably improves repair accuracy when integrated into ChatRepair pipelines (Yang et al., 21 Jul 2025).

Future directions include the development of dynamic, cognitively structured benchmarks (e.g., BloomAPR) that explicitly test reasoning ability beyond memorized fixes, integration with richer context retrieval and program analysis modules, adaptive sampling and feedback loops to maximize patch diversity and reduce computational cost, and further studies on mitigating memorization and improving generalization in LLM-based repair.

7. Significance and Outlook

ChatRepair marks a shift from static, one-shot repair approaches toward interactive, feedback-driven LLM-based repair frameworks that more closely mimic the iterative, context-rich workflow of human debugging. While it sets new standards in cost efficiency and repair rates on static bug benchmarks, its generalization remains constrained by sensitivity to context, overfitting to known fix patterns, and limitations in reasoning under syntactic or semantic perturbations (Ma et al., 29 Sep 2025).

Addressing these challenges will require research into better problem understanding, richer prompt designs, and context-aware, dynamically adaptive repair workflows. These insights are foundational for the evolution of trustworthy, context-robust LLM-powered APR and for broad applications in code synthesis, debugging, maintenance, and organizational knowledge management.