Repository-Level Code Generation

Updated 8 January 2026

Repository-level code generation is the automated synthesis and modification of multi-file codebases, ensuring that new features and cross-file dependencies are consistently integrated.
Benchmarks like FEA-Bench, RepoExec, and SolEval use metrics such as pass@k and Dependency Invocation Rate to evaluate the performance of automated multi-file modifications.
Recent innovations—including graph-based context modeling, AST-guided memory, and tool-integrated decoding—improve coherence and functionality, yet challenges in maintaining repository-wide consistency remain.

Repository-level code generation refers to the automatic synthesis or modification of code across a multi-file codebase, targeting the implementation of new features, completion of unfinished functions, or the correction of multi-file dependencies and issues. This task differs fundamentally from file-level or snippet-level generation by demanding that models coordinate changes across components, respect repository-wide invariants, and integrate new elements in a manner consistent with the broader software system. Current research highlights repository-level feature implementation, cross-file dependency resolution, multi-language support, and real-world incremental development as key challenges for automated code generation frameworks and LLMs.

1. Formal Definition and Task Scope

Repository-level code generation requires LLMs to perform non-local reasoning, adding new functions, classes, modules, and simultaneously updating related files such as imports, calls, tests, and documentation. This contrasts with file-level synthesis tasks (e.g., HumanEval, MBPP), where models only fill isolated blanks or generate single functions (Li et al., 9 Mar 2025).

Models must:

Manipulate multiple files simultaneously (adding/editing).
Preserve cross-file dependencies and invariants (naming, APIs, type hierarchies).
Integrate new features with existing test suites, ensuring verifiable correctness.
Respect real-world constraints, such as security (in smart contracts), resource usage, and system-wide semantic consistency (Peng et al., 26 Feb 2025).

The input for evaluation typically includes:

A repository snapshot (all source files).
A natural language feature request or specification.
Documentation and signatures of new components (sometimes supplied as hints).
Associated unit tests or acceptance criteria.

The expected output is a set of diffs or patches which, when applied, produce a repository state that satisfies the feature requirements and passes all relevant tests.

2. Benchmarking and Dataset Construction

Recent benchmarks formalize the evaluation of repository-level code generation:

FEA-Bench focuses on incremental feature implementation, extracting pull requests from 83 widely-used repositories and filtering by rule-based and intent-based criteria. Each benchmark task couples code changes with unit tests to enable execution-based verification, setting a new standard for measuring repository-scale LLM performance on feature growth (Li et al., 9 Mar 2025). Experimental results show that state-of-the-art LLMs such as DeepSeek-R1 achieve only 9.92% SuccessRate (unit test pass rate), with most models under 6%.
RepoExec emphasizes executable code with functional correctness and dependency invocation. Each sample provides a function skeleton and a controlled prompt encoding all necessary imports and in-file or cross-file dependencies. Dependency Invocation Rate (DIR) is introduced as a metric to quantify the fraction of supplied dependencies utilized in generated code, capturing context reuse rather than only functional correctness (Hai et al., 2024).
SolEval provides a repository-level benchmark for Solidity smart contracts, incorporating both correctness (pass@k), gas fee analysis, and vulnerability rates via static analysis. The highest-performing model (DeepSeek-V3, 671B params) reaches only 26.29% Pass@10, with security and gas efficiency further constraining practical utility (Peng et al., 26 Feb 2025).
MRG-Bench offers multi-language evaluation (Python, Java, Go), with project-level runnable tests and detailed failure categorization (distinguishing "what to do" vs "how to do"). Pass@1 rates remain below 40% for all methods, and the majority of errors stem from requirement misunderstanding rather than implementation mistakes (Li, 5 Aug 2025).

Benchmark construction methodologies commonly rely on:

Extraction of real world code from popular repositories with extensive pull requests and tests.
Automated parsing and human annotation to ensure high-quality and meaningful contexts.
Execution-verified instances—only those samples where gold patches pass all relevant tests are retained for evaluation.

3. Retrieval, Context Construction, and Prompt Engineering

Precisely curating the context presented to an LLM is central to repository-level generation:

Context Retrieval: Methods range from BM25 and embedding-based approaches to graph or knowledge-graph-based retrieval. For instance, GraphCoder uses coarse-to-fine retrieval over a Code Context Graph (CCG), capturing control-, data-, and control-dependence between statements and ranking candidates by subgraph edit distance (Liu et al., 2024). SaraCoder introduces Hierarchical Feature Optimization—semantic alignment, redundancy pruning, structural proximity, and diversity-aware reranking—to refine the retrieved set (Chen et al., 13 Aug 2025).
Graph-based Context Modeling: Multiple approaches represent the codebase as a graph—be it a Repository Structural Semantic Graph (RSSG), knowledge graph, or dual static-dynamic semantic graph—with nodes for entities and edges for relationships (calls, imports, inheritance). Such representations enable the retrieval of contextually relevant subgraphs, facilitating multi-view prompt construction and cross-file coherence (Zhang et al., 10 Nov 2025, Athale et al., 20 May 2025, Liu et al., 20 Jul 2025).
Bidirectional Inlining: InlineCoder reframes repository-level synthesis as a function-local coding task, inlining initial draft implementations into their call graph (upstream/caller and downstream/callee contexts), supported by perplexity-based confidence estimation to guide further edits (Hu et al., 1 Jan 2026).
Type Context Extraction: CatCoder leverages static analyzers and type servers to augment code snippets with precise API knowledge and one-hop dependency graphs for statically typed languages, empirically improving pass@k metrics (Pan et al., 2024).
Memory and Iterative Session Management: CodeMEM introduces AST-guided adaptive memory, preserving and updating repository context across multi-turn interactions, explicitly filtering and linking memory blocks to mitigate forgetting and cognitive overload in iterative developer workflows (Wang et al., 6 Jan 2026).

4. Evaluation Metrics and Model Performance

Benchmarking repository-level generation frameworks involves several metrics:

SuccessRate / Pass@k: The fraction of tasks (or sampled generations) where all paired unit tests pass after applying the model’s output. FEA-Bench and SolEval formalize this as the primary metric for feature development and smart contract synthesis (Li et al., 9 Mar 2025, Peng et al., 26 Feb 2025).
Dependency Invocation Rate (DIR): Quantifies the reuse of prompt-provided cross-file dependencies (Hai et al., 2024).
Patch Application Rate: Measures the syntactic validity and git-applyability of generated diffs.
Compilation Success: Used in Solidity and Java evaluation (Compile@k), the fraction of compilable outputs out of n samples.
Gas Fee and Vulnerability Rate: In SolEval, additional metrics include contract gas cost and static vulnerability alerts (Peng et al., 26 Feb 2025).
Edit Similarity, Identifier EM, F1: String-based or identifier-based exact match and similarity scores in code completion and synthesis tasks.
Instruction and Conversation Accuracy: For iterative workflows, CodeMEM tracks instruction following and session-level forgetting (Wang et al., 6 Jan 2026).

Across all benchmarks, repository-level tasks remain drastically more challenging than file-level or bug-fix tasks, with top-performing models struggling to surpass 10–30% pass rates. Larger context windows do not predict better performance, and precise retrieval of relevant snippets or subgraphs consistently yields higher success than brute-force context expansion (Li et al., 9 Mar 2025, Chen et al., 13 Aug 2025, Liu et al., 20 Jul 2025).

5. Challenges and Limitations

Key empirical findings and challenges from recent results:

LLMs underperform on repository-level tasks due to difficulties in:
- Coordinating cross-file and cross-component edits (especially for new features spanning multiple files).
- Maintaining semantic consistency and invariants when adding multiple functions or modules (FEA-Bench: SuccessRate drops from ~19% for adding one new function to ~5% for ≥3).
- Generating syntactically valid patches and coherent diffs.
- Discerning truly relevant files in large codebases—retrieval precision is favored over sheer context length.
Long context windows yield limited gains, often worsening performance by introducing noise (Li et al., 9 Mar 2025, Li, 5 Aug 2025).
Instruct-tuned models improve dependency reuse but sometimes hallucinate complex, unnecessary structures, indicating ongoing trade-offs in prompt fidelity (Hai et al., 2024).
Models frequently struggle with requirement misunderstanding ("what to do") over implementation errors ("how to do")—majority of failures are due to poor intent comprehension (Li, 5 Aug 2025).

6. Advances in Model Architectures and Algorithms

Recent algorithmic innovations include:

Reinforcement Learning for Retrieval: RLCoder trains a retriever in a pipeline with a generator, using code-generation perplexity as a reward and a stop signal mechanism to filter irrelevant candidates, yielding 12.2% EM improvements over baselines on CrossCodeEval (Wang et al., 2024).
Constraint Satisfaction and Knowledge Graphs: SemanticForge merges dual static and dynamic semantic graphs and integrates real-time constraint verification (via SMT-solving) into beam search, pruning logical and schematic hallucinations, with a >7% gain on Pass@1 and >50% reduction in architectural errors over comparable systems (Zhang et al., 10 Nov 2025).
Systematic Multi-View Context Integration: RepoScope builds a comprehensive RSSG and retrieves four distinct types of contextual signals—structural neighbors (callers), predicted call chains (callees), lexically similar functions, and similar file fragments—improving pass@1 by up to 36% and demonstrating the need for structurally coherent, multi-perspective prompt construction (Liu et al., 20 Jul 2025).
Tool-Integrated Decoding: ToolGen interleaves standard LLM decoding with autocompletion tool invocations, leading to marked improvements in dependency coverage (+31.4% to 39.1%) and static validity rate (+44.9% to 57.7%), and maintaining competitiveness in BLEU and CodeBLEU (Wang et al., 2024).
Instruction-Aware Memory and Analyzer Integration: CodeMEM employs AST-driven selectors and detectors to adaptively manage repository context and iterative developer sessions, improving instruction accuracy by 12.2% and reducing interaction complexity (Wang et al., 6 Jan 2026).
Surveyed Paradigms: Retrieval-augmented code generation is systematically categorized by retrieval modalities (lexical, dense, graph-based, hybrid), generation strategies (direct conditioning, fusion-in-decoder, reranking), model architectures, and training paradigms (supervised, RL, agentic) (Tao et al., 6 Oct 2025).

7. Implications and Future Directions

Key avenues for advancing repository-level code generation:

Enhanced context selection and filtering: Models must prioritize only the most pertinent files, leveraging static/dynamic analysis and intelligent context pruning (Li et al., 9 Mar 2025, Liu et al., 20 Jul 2025, Zhang et al., 10 Nov 2025).
Structured edit representation: Combining AST-based diffs and natural-language planning for better patch coherence.
Multi-stage pipelines: Employing draft, review, and debug cycles, with test-driven prompting and feedback integration (Li et al., 9 Mar 2025).
Curriculum learning: Training LLMs on simpler incremental growth tasks prior to full repository changes.
Gas-aware and security-aware prompting: For domains such as smart contract synthesis, incorporating performance and safety metrics in model training and evaluation (Peng et al., 26 Feb 2025).
Cross-language and multi-modal benchmarks: Expansion to more languages, hardware design (RTL/Verilog), and code modalities (e.g., documentation, configuration files) (Li et al., 25 Feb 2025, Li, 5 Aug 2025).
Agentic workflows and memory architectures: Enabling models to manage, adapt, and track repository context over long sessions and iterative development (Wang et al., 6 Jan 2026, Tao et al., 6 Oct 2025).
Integration with developer tools and IDEs: Bridging research progress with practical coding assistants and CI/CD environments (Tao et al., 6 Oct 2025).

Repository-level code generation remains an active area of research, with current LLMs demonstrating considerable headroom for improvement on benchmarks such as FEA-Bench, RepoExec, SolEval, GraphCoder, and DeepCircuitX. Progress in context modeling, constraint integration, learning paradigms, and benchmarking will be central to achieving practical, robust automated software engineering at the scale of real-world repositories.