CodeR: Multi-Agent Issue Resolution & Retrieval

Updated 13 March 2026

CodeR is a research system that integrates multi-agent task-graph frameworks for automated GitHub issue resolution with a dense vector code retrieval model.
The multi-agent design decomposes tasks into specialized roles such as reproducer, fault localizer, and verifier, ensuring reliable patch submissions based on explicit planning.
The retrieval model leverages a dual-encoder architecture with synthetic data and a three-stage curriculum, yielding state-of-the-art performance on diverse code search tasks.

CodeR designates several distinct research systems in software engineering and machine learning. The term notably refers to (1) a multi-agent task-graph framework for automated GitHub issue resolution (Chen et al., 2024), (2) a general-purpose code retrieval model utilizing synthesized data and curriculum learning (Li et al., 19 May 2025), as well as (3) earlier systems for code recommendation (Jin et al., 2022) and cross-lingual medical term normalization (Yuan et al., 2020). This entry focuses on the two most prominent contemporary CodeR instantiations: the task-graph-driven issue resolver and the retrieval-oriented code embedding model. Both advance the state-of-the-art in reasoning about code repositories, exploiting advances in LLM orchestration and synthetic data generation, respectively.

1. Task Graph-based Multi-Agent Issue Resolution

The CodeR system for automatic issue resolution departs from monolithic, improvisational LLM approaches by introducing a structured, multi-agent architecture governed by pre-defined task graphs (Chen et al., 2024). The end-to-end pipeline targets repair and feature addition tasks in code repositories, operating as follows:

Agent Decomposition: The issue-resolving process is modularized into five role-specialized agents:
- Manager: Orchestrates plan selection and final submission
- Reproducer: Attempts to reproduce the reported bug, generating test cases
- Fault Localizer: Pinpoints suspicious code units using a hybrid spectrum-based fault localization (SBFL) and BM25 retrieval
- Editor: Proposes edits to the codebase based on localized faults
- Verifier: Validates bug resolution by executing tests, including both original and synthesized cases
Task Graph Formalism: A plan is realized as a directed, JSON-serializable graph where nodes encode (agent, subtask) pairs, edges are labeled “Success” or “Failure,” and the “entry” node launches execution with the Manager. Traversal continues until the “submit” node, which delivers a final patch as a “git diff.”

The system’s runtime loop deterministically traverses graph nodes by loading the appropriate agent/prompt, executing the subtask, collecting reports, and selecting the next state based on Success/Failure. This explicit planning ensures both agent role fidelity and faithful execution of complex, multi-stage workflows.

2. Techniques and Algorithmic Components

Each CodeR agent leverages the ReAct paradigm—interleaving explicit reasoning and action—by repeatedly querying GPT-4 (preview-1106) via engineered prompts specifying capabilities and allowed actions. All inference is prompt-driven; no model retraining occurs.

A core algorithmic advance lies in the Fault Localizer, whose suspiciousness scoring function is:

$\operatorname{Score}(F_i) = \lambda \cdot \operatorname{Score}_\mathrm{Ochiai}(F_i) + (1-\lambda)\cdot\operatorname{Score}_\mathrm{BM25}(F_i)$

with

$\operatorname{Score}_\mathrm{BM25}(F_i) = \frac{\operatorname{Relevance}_\mathrm{BM25}(F_i)}{\sum_{F_j}\operatorname{Relevance}_\mathrm{BM25}(F_j)}$

Empirically, $\lambda=0.99$ yields maximum top-k fault localization precision, ensuring test failures synthesized by the Reproducer agent effectively constrain edit localization. This integration of symbolic SBFL and neural BM25 ensures robustness and synergy between classic and LLM-driven components.

Other distinguishing characteristics:

Inference Hyperparameters: All agents use GPT-4 with nucleus sampling (top_p=0.95, temperature=0), cost is bounded by $\$8$/issue, and history is truncated per role (e.g., last five turns for Reproducer/Fault Localizer; full for Manager).
Patch Submission: When the Verifier agent confirms resolution (passing all tests), Manager issues a “submit” action, emitting a patch via “git diff.”

3. Empirical Evaluation and Design Insights

The primary benchmark is the 300-issue SWE-bench lite suite. CodeR achieves 28.33% resolution on first submission, with a unified reproduction rate of 27.33% (Chen et al., 2024). These results surpass contemporaneous automated agents:

System	Success Rate (%)
CodeR	28.33
Aider	~26
SWE-agent + GPT-4	~16
AutoCodeRover	19
Explicit RAG + LLM	<5

On average, 30.4 API requests and 299K tokens (\$3.09) are consumed per issue.

Ablation studies reveal that removing the multi-agent/task-graph scaffold reduces performance (22%→10%), while disabling SBFL+BM25 in fault localization causes a drop from 22% to 14%. These findings highlight the essentiality of explicit workflow decomposition and hybrid symbolic/neural localization.

4. Generalist Code Retrieval: Architecture, Data, and Curriculum

A separate, state-of-the-art “CodeR” system addresses the challenge of general-purpose code retrieval at scale (Li et al., 19 May 2025). Distinct from the LLM-orchestrated pipeline above, this instantiation focuses on dense vector retrieval for diverse software engineering tasks.

Encoder Architecture: Two-armed dual-encoder based on Qwen-2.5-Coder-1.5B, fine-tuned via Low-Rank Adaptation (LoRA) inserted into self-attention projections (rank=32, $\alpha=64$ ). The [EOS] token’s final hidden state serves as the embedding.
Training Objective: Standard InfoNCE contrastive loss with temperature $\tau=0.02$ drives positive query-code pairs together and negatives apart:

$\mathcal{L} = -\log\frac{\exp(\mathrm{sim}(q,d^+))}{\exp(\mathrm{sim}(q,d^+)) + \sum_{d\in D^-}\exp(\mathrm{sim}(q,d))}$

Synthetic Data Engine (CodeR-Pile): A 2.9M-example dataset synthesized under the DRU (Diversity, Reliability, Usability) principle. Coverage spans 47 retrieval tasks (text2code, code2text, code2code, hybrid), 20 PLs, and two NLs. Negatives are mined to enforce nontrivial discriminative power.
Annealing Curriculum: A three-stage, data-driven curriculum:
1. “Warm-up”: text-only matching (e.g., MS MARCO)
2. “Intensive”: mixture of text, existing code, and synthetic code retrieval data
3. “Cool-down”: code-only data, filtered for medium/hard positives via E5-base and GPT-4o-mini labels

This schedule yields superior transfer of semantic knowledge and robust specialization for code tasks.

5. Comparative Performance and Empirical Outcomes

Evaluation encompasses both in-domain (CoIR/CoIR-filter; 10 datasets, 8 retrieval tasks) and out-of-domain (CodeRAG; tasks for retrieval-augmented code generation) benchmarks. CodeR-1.5B achieves:

CoIR (NDCG@10, full): 81.77 (vs. prior best 78.53)
CoIR-filter: 93.00 (vs. 88.73)
CodeRAG (OOD): 72.8 (vs. best open-source ≈67.0)

Ablations confirm that full-stage Annealing outperforms naïve data mixing, that text exposure is essential for transfer, and that aggressive hard-negative mining improves generalization (Li et al., 19 May 2025).

Key performance enablers include (a) breadth and quality of CodeR-Pile, (b) a curriculum that anneals from semantic matching to code-specific discrimination, and (c) multi-stage negative filtering.

6. Limitations and Prospective Developments

The multi-agent CodeR (Chen et al., 2024) is currently constrained to four static plan topologies, Python-only projects, and requires self-contained tests. Issues demanding dynamic plan composition or non-test-based verification are not addressed. Future prospects include expansion of the plan library, human-in-the-loop plan synthesis, and enhancement of fault localization for broader codebase contexts.

The code retrieval CodeR (Li et al., 19 May 2025) is limited by the annotation quality of synthetic data (negative-label accuracy 38%—73% depending on LLM), lack of real-world retrieval annotations at this scale, and unknown long-term coverage for new programming paradigms or languages.

Both research programs suggest trajectories toward more robust, generalizable, and fully autonomous systems for software engineering problems via explicit workflow modeling and curriculum-driven pretraining.