NL2Repo Bench: Repository-Scale Code Generation

Updated 18 December 2025

NL2Repo Bench is a benchmark that measures LLMs' ability to generate full-scale code repositories from natural language, emphasizing multi-module coherence and dependency management.
It integrates rigorous evaluation protocols, including test pass rates, structural consistency metrics, and cross-file architectural verification.
Empirical results highlight challenges in long-horizon planning and multi-file synthesis, demonstrating current LLM limitations in real-world software engineering tasks.

Natural Language to Code Repository Benchmarks (NL2Repo Bench)

Natural Language to Code Repository (NL2Repo) benchmarks constitute a class of evaluations measuring the ability of LLMs or autonomous coding agents to translate natural language specifications or tasks into complete, multi-module, real-world code repositories. Unlike function-level or isolated code synthesis, NL2Repo benchmarks require architectural reasoning, dependency management, sustained planning, and robust cross-file coherence, reflecting the challenges encountered in actual software engineering projects. The field has recently diversified into several rigorous benchmarks, each probing different aspects and modalities of repository-scale code understanding and generation.

1. Scope and Formal Task Definitions

The NL2Repo paradigm extends beyond single-file code synthesis to the generation, modification, or comprehension of full-scale repositories from natural language descriptions, bug reports, or diff-style specifications. Benchmarks may target end-to-end repository creation, repair, feature addition, secure code patching, or long-context navigation.

For example, in "NL2Repo-Bench" (Ding et al., 14 Dec 2025), the core task requires an agent to transform an ∼18,800-token natural language requirements specification—with project description, dependency list, API guide, and example usage—into a fully installable Python repository. Only the specification and an empty workspace are provided. Performance is validated exclusively by executing the original project's pytest suite in a reconstructed Docker environment, ensuring strict functional equivalence.

Similarly, "SecRepoBench" (Dilgren et al., 29 Apr 2025) tasks are defined as tuples $(x, C, D, U, S)$ , where $x$ is a natural language intent for a masked code region in a C/C++ repository $C$ , $D$ is the relevant unit tests, $U$ an OSS-Fuzz input triggering a vulnerability, and $S$ the ground-truth security patch. The LLM must generate a candidate $y$ that, when substituted into $C$ , passes all of $D$ and is robust to $U$ .

Other NL2Repo benchmarks (e.g., "CodeS"/SketchEval (Zan et al., 25 Mar 2024), "CoreCodeBench" (Fu et al., 4 Jul 2025), "NoCode-bench" (Deng et al., 24 Jul 2025)) specialize the task to feature addition, multi-scenario engineering workflows, or structured repository sketching, varying language, domain, and difficulty.

2. Benchmark Construction and Dataset Statistics

NL2Repo benchmarks derive their rigor from carefully curated datasets drawn from real-world repositories, often filtered and annotated to ensure task validity, diversity, and challenge:

NL2Repo-Bench (Ding et al., 14 Dec 2025): 104 Python library reproduction tasks from categories including System Tools, Data Processing, Testing Frameworks, Networking, and Machine Learning. Task sizes range widely (300–120,000 LOC), with difficulty partitioned into Easy, Medium, and Hard. Each task is accompanied by a complete requirements document and upstream test suite.
SecRepoBench (Dilgren et al., 29 Apr 2025): 318 secure code-completion tasks from 27 mature C/C++ repositories. Vulnerabilities span 15 CWEs (e.g., CWE 122 heap-based buffer overflow accounts for 54.4% of tasks). Each sample pairs an NL intent, repo context, unit tests, a reproducing fuzzer input, and a validated patch.
SketchEval (Zan et al., 25 Mar 2024): A repository-oriented benchmark evaluating end-to-end synthesis on 19 reference Python projects (total 194 files, 37,100 LOC) grouped by complexity.
CoreCodeBench (Fu et al., 4 Jul 2025): Automatically generated atomic and composite questions from large-scale repositories (size > 5 KLOC, high test coverage) covering scenarios such as development, bugfix, TDD, and complex multi-function aggregation.
NoCode-bench (Deng et al., 24 Jul 2025): 634 feature-addition tasks in 10 open-source Python projects. Each instance couples a documentation diff (NL spec) to a set of code and test changes, with a human-verified, high-quality subset ("NoCode-bench Verified") for precise evaluation.

3. Evaluation Protocols and Metrics

NL2Repo benchmarks employ repository-level, execution-based, and structural metrics that move beyond line- or function-level correctness:

Test Pass Rate: In NL2Repo-Bench, the primary metrics are Module-level PassRate (mean fraction of test cases passed per module) and Full-repository SuccessRate (fraction of tasks where all tests pass). SecRepoBench uses pass@1 (unit-test success) and secure-pass@1 (unit-test plus security/non-crash criteria). CoreCodeBench includes both Pass@1 and PassRate, with additional BLEU and LineCoverage for generated tests.
Structural Consistency: SketchEval introduces "SketchBLEU," aggregating n-gram, structure, and dataflow matches over concatenated repository code, directory tree, and function-level ASTs.
Patch Application and Regression: NoCode-bench tracks Success%, Applied% (clean patch application), regression test (RT%), feature-validation rates (micro/macro), and File-matched rate (cross-file accuracy).
Early Termination and Planning: NL2Repo-Bench introduces EarlyStopRate (premature finish action) and NonFinishRate (passive failure) to capture agentic failure modes stemming from inadequate long-horizon planning.

These metrics enforce end-to-end functional, architectural, and cross-file correctness, often requiring agents to demonstrate full-stack reasoning and robust packaging.

4. Agent Architectures, Pipelines, and Prompting

State-of-the-art NL2Repo evaluations rely on agentic frameworks capable of sustained architectural reasoning, tool invocation, and iterative self-verification:

Tool-Oriented Agents: OpenHands CodeAct and Cursor-CLI (used in NL2Repo-Bench) provide primitives for file editing, execution, test running, browser lookup, and explicit task tracking, aiming to reflect real software development loops ("edit–test" cycles).
Prompt Engineering: Extensive experiments in SecRepoBench and CoreCodeBench show that security-generic, security-specific, and policy-based reminders affect outcomes, but are dramatically less effective at the repository level than in smaller-scale benchmarks (+19 pp secure-pass@1 on synthetic benchmarks vs. +1.6 pp on SecRepoBench for the best prompt type).
Multi-layer Sketching: CodeS/SketchEval decomposes repository generation into RepoSketcher (directory and import generation), FileSketcher (file-level skeletons), and SketchFiller (function-level filling), with each layer evaluated independently and in aggregate, and supervised fine-tuning delivering substantial performance improvements over vanilla prompting.
Difficulty Control: CorePipe (for CoreCodeBench) exposes hyperparameters for scenario type (Dev, BugFix, TDD), call-graph depth, atomic subproblem count, and IG-filtering, supporting fine-grained benchmarking of LLM scaling and scenario variation.

5. Experimental Results and Failure Analysis

Across all NL2Repo benchmarks, current LLMs and coding agents exhibit limited full-repository competence, especially in hard or long-horizon tasks:

Benchmark	Top Model(s)	Overall Test/SketchBLEU	Full Success Rate	Representative Failure Modes
NL2Repo-Bench (Ding et al., 14 Dec 2025)	Claude-Sonnet-4.5	40.2 % PassRate	≲5 % (Pass@1)	Overconfidence, passive stalling, fragile imports, loss of coherence, planning-execution mismatch
SecRepoBench (Dilgren et al., 29 Apr 2025)	Claude 3.7	28.0 % secure-pass@1	—	Hallucinated identifiers, undeclared names, missing checks, incorrect logic, uninitialized reads
SketchEval (Zan et al., 25 Mar 2024)	CodeS SFT	63.3 % SketchBLEU	—	Cascading errors in structure, cross-file references
CoreCodeBench (Fu et al., 4 Jul 2025)	Claude-3.7	Single-function: ~55 % AC@1	Multi-function: 5–20 %	Planning collapse, high BugFix difficulty, fragile multi-function composition
NoCode-bench (Deng et al., 24 Jul 2025)	Claude-4-Sonnet	15.79 % Success (verified)	6.15–6.94 % (full)	Cross-file edit failure, codebase misunderstanding, tool invocation breakdown

Empirical analyses reveal that easy, single-file, or single-function tasks are tractable, but accuracy degrades sharply for multi-file, deep-composition, or cross-scenario cases. For instance, in NoCode-bench Verified, single-file tasks see ≈25% success, but multi-file ≤7.6 %. On NL2Repo-Bench, only five full-repo completions were observed over all models and tasks. Even "agentic" self-repair, effective in smaller benchmarks, fails on repository-scale tasks (0/43 success in SecRepoBench).

Common failure causes include premature action (hallucinated verification), incomplete code (omitted imports, misfiled logic), insufficient planning (low explicit task tracking), context overflow, and brittle error handling in tool-based environments.

6. Benchmark Best Practices and Future Directions

NL2Repo benchmarks have surfaced several methodological insights and open problems:

Context Integration: Full-repository context, including build scripts and test pipelines, is essential to faithfully benchmark functional and security-related behaviors. Real-world context (e.g., fuzzing inputs for security, packaging scripts for installability) is required to expose LLM limitations (Dilgren et al., 29 Apr 2025).
Compositional Task Design: Decomposition into layered sketches or compositional atomic/subtasks is critical to isolate failure points and reduce cascading errors (Zan et al., 25 Mar 2024, Fu et al., 4 Jul 2025).
Supervised Fine-Tuning: Instruction tuning on multi-layer, repository-scale data provides meaningful performance improvements over prompt-only approaches.
Cross-file and Architectural Reasoning: Effective retrieval of cross-file dependencies, architectural artifacts, and informed planning (e.g., through task-tracking tools) remains a major bottleneck (Ding et al., 14 Dec 2025).
Evaluation Harnesses: Automated, execution-based assessment (test suites, code coverage), structural matching (SketchBLEU), and scenario coverage (multi-scenario YAML configs in CoreCodeBench) enable rigorous and reproducible evaluation.

Key research directions identified include scaling agentic frameworks for persistent state and subtask decomposition, integrating self-verification and symbolic analyzers, extending to more programming languages, and automating test or spec generation to reduce manual curation costs. NL2Repo benchmarks collectively define an objective foundation for the next generation of robust, context-aware, and autonomous code generation systems (Zan et al., 25 Mar 2024, Dilgren et al., 29 Apr 2025, Fu et al., 4 Jul 2025, Deng et al., 24 Jul 2025, Ding et al., 14 Dec 2025).

7. Relation to Long-Context Code Understanding

NL2Repo benchmarking is closely related to long-context code comprehension, as exemplified by RepoQA (Liu et al., 10 Jun 2024). RepoQA evaluates LLMs' ability to semantically retrieve functions from large, multi-language repositories using natural language descriptions, with context sizes up to 16 K tokens. While not a generative benchmark per se, RepoQA specifically highlights:

The necessity of deep semantic matching (purpose, input, output, procedure) over pure pattern retrieval.
Sensitivity to code context (e.g., presence/absence of comments), revealing that, in many settings, comments distract rather than help LLMs.
Language disparity in model performance (TypeScript/Java easier than Python/C++/Rust), mirroring uneven training distributions.

A plausible implication is that fully robust NL2Repo generation will require not only synthesis capabilities but also highly scalable, cross-language code understanding grounded in deep semantic representations.

References:

"NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents" (Ding et al., 14 Dec 2025)
"SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories" (Dilgren et al., 29 Apr 2025)
"CodeS: Natural Language to Code Repository via Multi-Layer Sketch" (Zan et al., 25 Mar 2024)
"CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark" (Fu et al., 4 Jul 2025)
"NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition" (Deng et al., 24 Jul 2025)
"RepoQA: Evaluating Long Context Code Understanding" (Liu et al., 10 Jun 2024)