RelRepair: Retrieval-Augmented Program Repair

Updated 24 September 2025

RelRepair is a retrieval-augmented automated program repair framework that integrates project-specific function signatures and code snippets to overcome LLM limitations.
It employs a hierarchical repair process by first retrieving function signatures (SigRepair) and then supplementing with code snippets (SnipRepair) using embedding-based similarity.
Empirical evaluations on benchmarks like Defects4J and ManySStuBs4J demonstrate significant improvements in patch accuracy and repair success rates.

RelRepair is a retrieval-augmented automated program repair (APR) framework specifically designed to overcome limitations present in general-purpose, language-model (LLM)-based bug-fixing systems by systematically incorporating project-specific information into the patch-generation process. This design principle responds to the observation that LLMs, even large ones, often lack sufficient knowledge of custom types, APIs, and local code context to generate correct patches—particularly for defects involving domain-specific identifiers or nuanced project relationships. RelRepair augments LLM capabilities by retrieving and integrating semantically relevant function signatures and code snippets from the target codebase, thus focusing the model’s generation process to be both contextually aware and project-sensitive (Liu et al., 20 Sep 2025).

1. Motivation and Framework Design

RelRepair’s core motivation is to rectify the often inadequate project-awareness of LLM-based APR tools. Standard approaches rely on patterns learned during pre-training on broad corpora, which means that novel types, methods, or patterns unique to a given project are underrepresented in the LLM’s weights. This results in suboptimal repairs, as LLMs may hallucinate non-existent API calls or misuse project-specific constructs. RelRepair addresses this by incorporating a retrieval-augmented generation (RAG) paradigm into the repair loop:

Retrieval of function signatures: identifies relevant functions in the same or nearby files, using both names and documentation.
Deeper code snippet retrieval: for more complex defects, retrieves semantically similar function bodies from the same file or other structurally close files, accounting for code and comment similarity.
Prompt construction: integrates both base context (buggy function, failure details) and retrieved artifacts directly in the LLM prompt, refining the model’s repair reasoning.

This division into “SigRepair” (signature-based) and “SnipRepair” (snippet-based) submodules is particularly salient, as it allows for a hierarchical repair process: fast, function-level repairs are attempted first, followed by deeper, structure-aware repairs if needed (Liu et al., 20 Sep 2025).

2. Retrieval and Rewriting Mechanism

The retrieval process in RelRepair consists of three main stages:

Query Rewriting: An LLM reformulates the input bug context to derive a candidate set of function names and a concise natural language root cause summary. These queries are enriched with information from the buggy function, test failures, and fault localization outputs.
Project Dataset Construction: The codebase is indexed at two granularities:
- Function signatures (name, parameter list, comments) relevant to the query function (same file, variable-based method lookup).
- Code snippets (entire function bodies) from the same file (“intra-file”) and from structurally similar files within the same directory (“inter-file”).
- Textual features are embedded using pre-trained models (e.g., SentenceBERT for signatures, CodeBERT for code/comments).
Similarity-Based Retrieval: For signature queries, cosine similarity between the embedding of the query and each candidate signature is used to select the top-25 most relevant signatures. For code-level queries, the similarity score is a weighted sum:

$\operatorname{Sim}(f) = \alpha \frac{E_{qf} \cdot E_f}{\|E_{qf}\| \|E_f\|} + \beta \frac{E_{qc} \cdot E_c}{\|E_{qc}\| \|E_c\|}$

where $E_{qf}, E_{qc}$ are embeddings for the buggy function’s code and comment, and $E_f, E_c$ are embeddings for the candidate function, with weights $\alpha, \beta$ adjusted to balance code-versus-comment similarity.

This design ensures that the retrieval process adapts to both structural and semantic features of code, not merely surface n-gram similarity (Liu et al., 20 Sep 2025).

3. Integration with LLM Prompting

RelRepair constructs a composite prompt for the LLM that, in addition to the buggy function, failing test, and error message, includes retrieved signatures and/or code snippets. This augmented prompt has the following properties:

Explicit exposure to local APIs: By providing relevant function signatures, the LLM is less likely to invent calls to undefined functions or misuse library calls.
Semantically rich local context: Code snippets from the same or similar files reveal usage idioms, variable types, and patterns of check or error handling that are specific to the codebase.
Hierarchical repair: RelRepair introduces repairs in three phases:
- BaseRepair: basic prompt, no retrieval.
- SigRepair: prompt augmented with retrieved signatures.
- SnipRepair: prompt further augmented with top retrieved code snippets.

Each failed phase triggers the next, with the aim of minimally sufficient context for successful repair.

4. Empirical Evaluation and Results

RelRepair’s performance is evaluated on Defects4J V1.2 and ManySStuBs4J—standard benchmarks in Java APR research:

Dataset	RelRepair Patch Count	Baseline Patch Count	Unique Fixes by RelRepair
Defects4J V1.2	101	47 (BaseChatGPT)	28
ManySStuBs4J	48.3% (fix rate)	31.2% (state-of-art)	—

On Defects4J V1.2, RelRepair repaired 101 bugs compared to 47 by the BaseChatGPT, including 28 bugs uniquely fixed by RelRepair.
On ManySStuBs4J, RelRepair achieved a 48.3% fix rate—a 17.1% absolute improvement—substantially outperforming all compared LLM-based APR techniques.
The inclusion of project-specific retrieval was critical for categories requiring nuanced identifier or file-level reasoning (e.g., DIFFERENT_METHOD_SAME_ARGS).

These results empirically demonstrate the impact of context augmentation on repair effectiveness (Liu et al., 20 Sep 2025).

5. Technical Foundations: Embedding and Similarity

The retrieval process is mathematically anchored in embedding-based nearest neighbor search:

Cosine similarity is computed as:

$\operatorname{sim}(e_q, e_i) = \frac{e_q \cdot e_i}{\|e_q\|\|e_i\|}$

For code snippets, the composite similarity uses two factors, $\alpha$ and $\beta$ , to combine structural (code) and descriptive (comment) features, and these are iteratively updated—for instance, via grid search using labeled validation data—to reflect their relative importance in particular projects or code categories.
Embedders such as SentenceBERT and CodeBERT are chosen for their effectiveness in capturing deep code semantics rather than merely syntactic similarity.

This mathematical apparatus enables efficient, scalable retrieval and is directly responsible for the high context relevance of injected code.

6. Impact and Implications for LLM-based Program Repair

The main implication of RelRepair is that LLMs in software engineering must be complemented with explicit project-aware retrieval mechanisms for maximal repair effectiveness:

Project-specific information is irreplaceable: Integrating function signatures and code snippets tailored to the codebase allows APR systems to generate patches that respect both naming conventions and project idioms.
Retrieval+Generation outperforms generation alone: Empirical evidence across multiple benchmarks suggests that even advanced LLMs cannot substitute for targeted context when facing codebase-unique bugs.
Hierarchical retrieval enables scalable repair: By sequencing from light (signatures) to heavy (code snippet) retrieval, RelRepair balances computational cost and patch quality.

This approach signals a future research direction where retrieval-augmented strategies become standard for all LLM-based automated software reasoning tasks, especially for settings requiring adaptation to specific project semantics (Liu et al., 20 Sep 2025).

7. Limitations and Directions for Future Work

A prominent limitation is the dependence of repair effectiveness on the quality of the retrieval set: if the codebase’s comments or names are non-discriminative, retrieval may be less helpful. The computational overhead for embedding calculation and similarity ranking can also be nontrivial for very large projects. Further, adjustments of weighting factors for code/comment similarity may require project-specific tuning. A plausible implication is that future work will explore more advanced embedding models, potentially integrate static analysis for more precise variable/type context, and optimize retrieval efficiency either via approximate nearest neighbor search or hierarchical indexing.

RelRepair represents a substantial step in the evolution of program repair systems—one that enables LLMs to move from generic template application to project-specific, semantically-informed repair, setting a strong precedent for future research in retrieval-augmented code intelligence.

PDF Markdown Chat (Pro)

References (1)

RelRepair: Enhancing Automated Program Repair by Retrieving Relevant Code (2025)

Follow Topic

Get notified by email when new papers are published related to RelRepair.