RepoCraft Benchmark: Evaluation of Code Generation

Updated 25 September 2025

RepoCraft Benchmark is a comprehensive evaluation suite that uses a persistent, dual-level Repository Planning Graph (RPG) to bridge high-level specifications with low-level code structures.
It employs a three-stage generation framework via ZeroRepo, incorporating proposal, refinement, and graph-guided code generation to ensure coherent, dependency-aware repository construction.
The benchmark integrates six open-source project templates and rigorous metrics (coverage, novelty, and test compliance) to outperform baseline models in generating large, correct codebases.

The RepoCraft Benchmark is a repository-scale evaluation suite for end-to-end codebase generation, designed to assess the capabilities of LLMs and automated agents in producing structurally coherent, functionally correct software repositories from high-level specifications. Utilizing a persistent, dual-level representation called the Repository Planning Graph (RPG), RepoCraft enables holistic evaluation of code generation, planning, validation, and agent localization across large, real-world codebases in diverse software domains (Luo et al., 19 Sep 2025).

1. Conceptual Foundations: Repository Planning Graph (RPG)

At the core of RepoCraft is the Repository Planning Graph (RPG), a persistent, explicitly structured graph designed to replace ambiguous natural language specifications with an executable blueprint for repository construction. The RPG encodes both proposal-level (abstract capability decomposition) and implementation-level (file, folder, class, and function mapping) planning:

Nodes: Dual semantics—higher-level nodes denote functional modules during specification, while, at implementation, they map onto concrete repository elements (folders, files, classes, or functions).
Edges: Two categories exist:
- Inter-module edges (solid black arrows in figure representations) encode explicit data flows (e.g., between ML data loaders and downstream model modules).
- Intra-module edges (dashed gray arrows) enforce file-level ordering constraints, reflecting dependencies necessary for valid code execution.
Graph Construction: Progresses from an initial capability tree (sourced from a taxonomy containing over 1.5 million features) through modularization, enrichment with structural and data flow details, down to concrete code components.

This explicit graph structure unifies high-level requirement satisfaction (“what” is being built) with low-level organizational details (“how” it is realized in code), enabling long-horizon, dependency-aware generation and guided debugging.

2. ZeroRepo and the Three-Stage Generation Framework

The reference agent for RepoCraft, ZeroRepo, operationalizes RPG through a graph-driven generation pipeline comprising three regulated phases:

Proposal-Level Construction: Converts user intent into a functional capability graph via large-scale feature subtree retrieval and refactoring, anchored to an extensive functionality ontology.
Implementation-Level Refinement: Enriches the capability graph with:
- Explicit folder/file structural plans
- Detailed data flows (mapping outputs between modules)
- Concrete code element nodes (functions, classes) representing leaf tasks
Graph-Guided Code Generation:
- Performs a topological traversal that obeys dependency orderings.
- Applies test-driven development (TDD) at each leaf node: tests are auto-derived and code is iteratively refined until all tests are successfully passed.
- Incorporates graph-localization and editing, meaning failed test cases trigger targeted repairs as directed by the RPG’s structure.

The RPG thus acts both as a plan artifact and a real-time navigational guide, ensuring code generation remains coherent with high-level goals and structural constraints.

3. Construction, Composition, and Design of RepoCraft

RepoCraft is constructed from six well-recognized, high-complexity open-source projects (paraphrased to avoid test leakage):

Paraphrased Name	Original Project	Domain
MLKit-Py	scikit-learn	Machine learning
TableKit	pandas	Data analysis
SymbolicMath	sympy	Symbolic computation
StatModeler	statsmodels	Statistical modeling
HttpEasy	requests	HTTP client
PyWebEngine	django	Web development

Each project provides a gold-standard reference for structure, capabilities, and correctness.
Tasks (totaling 1,052) are designed to span multiple software engineering scenarios including code implementation, interface design, module composition, and test compliance.
The benchmark ensures agent models must synthesize code de novo, avoiding retrieval or direct copying of any original codebases.

4. Evaluation Protocols and Metrics

RepoCraft establishes multi-dimensional assessment metrics:

Functionality Coverage: Fraction of taxonomically defined (ground-truth) functionalities realized in the generated repository.
- Expressed as $\text{Coverage} = \frac{1}{|\mathcal{C}|} \sum_{j=1}^K {\mathbf{1}[\exists g_i \in \mathcal{G} ~\text{such that}~ f(g_i) = c_j]}$
- where $g_i$ is a generated artifact, $c_j$ is a reference function, and $f$ is feature-matching.
Functionality Novelty: Quantifies the presence of out-of-distribution functions not found in the reference, indicating agent creativity or over-generalization.
Functionality Accuracy: Evaluated via adapted test cases, reports both “pass rate” (percentage of tests passed) and “voting rate” (consensus among multiple semantic validation checkers).
Code-Level Statistics: Includes lines of code (LOC), file counts, and tokens, providing insights into codebase scale and complexity.

This systematic framework allows fine-grained comparison of end-to-end code generation agents, focusing on both capacity (scale) and correctness (executability and specification compliance).

5. Comparative Results and Agent Performance

ZeroRepo, leveraging RPG as the internal planning mechanism, demonstrated superior performance on RepoCraft:

Scale: Generated repositories averaged nearly 36,000 LOC, 3.9 times the size of the strongest baseline (Claude Code) and approximately 64 times the size of outputs from other baseline agents.
Coverage and Correctness:
- Achieved 81.5% functionality coverage.
- Achieved a 69.7% test/functional “pass rate,” exceeding Claude Code by 35.8 percentage points.
Baseline Comparison: Multi-agent systems such as MetaGPT and ChatDev, as well as CLI-driven LLM agents (Codex CLI, Gemini CLI, Claude Code CLI), underperformed in both repository breadth and correctness metrics compared to the graph-guided approach.
These results indicate that persistent, dependency-respecting planning (as embodied by RPG) substantially enhances the ability of LLM-based agents to construct large, correct, and feature-complete repositories from natural language requirements.

6. Implications and Prospective Research Directions

RepoCraft establishes a new evaluation regime emphasizing structured, transparent, end-to-end software generation and planning. The use of RPG as a planning substrate has several implications:

Structured Planning: Disambiguation of complex software architectures via persistent graphs facilitates both initial synthesis and subsequent agent-driven repair/debug.
Long-Horizon Generation: Explicit ordering and dependency annotation enable the generation of codebases approaching the structural and functional scale of modern human-made software.
Localization and Repair: The RPG’s graph structure supports guided localization of errors, making iterative repair both tractable and efficient.

Prospective directions suggested in the benchmark’s context include:

Improving graph-guided localization and debugging to reduce repair cycles and enhance precision.
Adapting RPG construction dynamically as repositories evolve, supporting granular and continuous software evolution.
Scaling to multi-language and heterogeneous domains, as well as investigating collaborative multi-agent planning atop persistent structured representations.
Enhancing test-generation and validation, further bridging the gap to human-engineered projects.

7. Position in Benchmark Ecosystem

RepoCraft is purpose-built for end-to-end repository construction and assessment, setting it apart from prior benchmarks such as RepoBench (Liu et al., 2023)—which focuses on repository-level code auto-completion within existing repositories—and CoreCodeBench (Fu et al., 4 Jul 2025), which targets configurable, multi-scenario engineering tasks (development, bug-fixing, test-driven coding) at the repository level but does not specifically address initial repository creation from only a specification. RepoCraft accordingly fills a critical gap for holistic, planning-centric evaluation, where code generation agents are challenged to demonstrate architectural coherence, test compliance, and cross-component completion in realistic software engineering conditions.

PDF Markdown Chat (Pro)

References (3)

RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation (2025)

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems (2023)

CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark (2025)

Follow Topic

Get notified by email when new papers are published related to RepoCraft Benchmark.