RepoExec: Code Generation & Reproducibility

Updated 27 March 2026

RepoExec is a suite of benchmarks, architectures, and platforms for repository-level code generation, emphasizing comprehensive dependency handling, functional verification, and reproducibility.
It integrates an LLM-driven test pipeline that boosts average test coverage from 92.46% to 96.25% through iterative, assertion-based Python test case generation and verification.
The framework employs advanced metrics like the Dependency Invocation Rate (DIR) and supports remote execution via cloud-native APIs, ensuring reliable and collaborative computational experiments.

RepoExec is the designation for a suite of benchmarks, architectures, and platforms that enable rigorous, repository-level code generation, execution, and reproducibility studies—emphasizing comprehensive dependency handling, functional verification, and collaborative experimentation. RepoExec originated as a benchmark for LLM-driven code generation under complex cross-file dependencies and executability constraints, but the term also denotes supporting remote execution engines and reproducibility platforms tailored for computational experiments and multi-language code scenarios.

1. Formal Definition and Benchmark Construction

RepoExec, as introduced in Nam et al. (2024) (Hai et al., 2024), is a benchmark precisely designed for repository-level Python code generation and evaluation. Each benchmark instance is defined by:

Input context: for a target function $f$ $f$ in a repository $R$ $R$ , the benchmark supplies:
- The set of direct, human-curated dependencies $D_s$ (in- and cross-file: helpers, classes, constants, imports).
- Selected context $C$ (imports, signatures, docstrings, declarations, class/function stubs).
- The function’s signature $S_f$ and natural-language docstring describing the intended functionality.
Task: Generate the body of $f$ $f$ such that, when inserted into $R$ $R$ :
1. The repository installs and executes without errors (“executability”).
2. The implementation passes a high-coverage, automatically generated test suite (“functional correctness”).
3. The generated function utilizes the provided dependencies (“context utilization”).

Three prompt context variants are defined: “full-size” (all code bodies), “medium-size” (signatures + docstrings), “small-size” (signatures only), with BasePrompt or InstructPrompt formatting (see (Hai et al., 2024) Appendix C).

The RepoExec dataset consists of 355 Python repository-level tasks, each with an average context from 51 files and 96.25% average test suite coverage (see Table 1 of (Hai et al., 2024)).

2. Automated Test Creation and Validation Protocol

Unlike task suites reliant on pre-existing tests, RepoExec enforces an LLM-driven test pipeline that ensures high functional coverage and reduces evaluation bias:

Test-case generation begins by prompting CodeLlama-13B to emit up to 20 assertion-based Python test cases per function. Each candidate is syntax-checked (AST parse, function call coverage) and then execution-checked under pytest. For AssertionErrors, actual outputs are captured and assertions rewritten to check against empirical outputs. Each surviving test is executed 10 times to filter flakiness.
Coverage enhancement supplements the coverage using GPT-3.5, which is prompted for additional edge, corner, and input-interaction cases. Only those tasks achieving ≥40% test coverage are retained.
The protocol results in average coverage increases from 92.46% to 96.25% ((Hai et al., 2024) Table 1).

As a result, functional correctness in RepoExec is strictly defined as passing the complete auto-generated test suite under real repository conditions.

3. Dependency Utilization Metric and Instruction-Tuned Dataset

RepoExec introduces the Dependency Invocation Rate (DIR) as a metric to quantify the degree of model utilization of provided context:

$\mathrm{DIR} = \frac{|D_g \cap D_s|}{|D_s|}$

where $D_g$ is the set of dependency identifiers (functions, classes, constants) invoked in the generated implementation, and $D_s$ is the supplied dependency set. DIR is averaged across test instances.

To promote dependency-sensitive code generation, RepoExec includes a 154,818-sample instruction-tuned dataset sourced from over 1,555 Python repositories. Static analysis and AST parsing yield training samples where functions are paired with precisely their true direct dependencies and context (either full context, BasePrompt, or Small-context variant). Finetuning is performed with LoRA adapters (five epochs, 10% validation split) for StarCoder family and CodeLlama-13B-Python.

Example instruction prompt: $R$ 5 Such explicit dependency instruction is critical for maximizing DIR and, by extension, aligning generated code with repository semantics (Hai et al., 2024).

4. Evaluation Suite and Quantitative Results

RepoExec provides a standardized evaluation suite across 13 advanced code-oriented LLMs, spanning pretrained baselines, instruction-tuned commercial models, and custom finetuned variants. The evaluation protocol is as follows:

Nucleus sampling ( $R$ 0, $R$ 1), with $R$ 2 outputs for pass@ $R$ 3 measurement.
Metrics: pass@1 and pass@5 denote the proportion of tasks for which at least one of the $R$ 4 generations passes all auto-generated tests. DIR is reported for context utilization.
All runs use a controlled scenario where “direct” dependencies are provided (i.e., without retrieval noise).

Selected results (full-context, BasePrompt, see Table 3 of (Hai et al., 2024)):

Model	pass@1	pass@5	DIR (%)
CodeLlama-34B-Python	42.93	49.54	68.85
WizardCoder-Python-13B-V1.0	34.31	40.06	62.90
StarCoder (pre)	28.08	–	58.67

Multi-round debugging increases pass@1 by 14–16pp for instruction-tuned models and GPT-3.5, while DIR jumps by >7% (see Table 4 and Fig. 7).

Instruction-tuned models further boost DIR post-finetuning (e.g., StarCoder DIR: 58.67 → 69.80; CodeLlama-13B-Python DIR: 62.26 → 68.89; Table 5).

5. Repository-Level Code Generation Frameworks

RepoExec directly catalyzed new code-generation and agentic frameworks for repository-scale code synthesis:

HyperAgent (Phan et al., 2024) implements a four-agent pipeline (Planner, Navigator, Code Editor, Executor) for RepoExec tasks, eschewing “gold” contexts: agents coordinate via message queues and LLM-driven summarization. Navigator leverages IDE-style tools (go_to_definition, code_search, file-tree), resulting in minimal, relevant context passed to the generation module. In 355-repo evaluation, HyperAgent-Lite-3 outperforms retrieval-augmented RAG strategies:

| Model | Context | pass@1 | pass@5 | |------------------------------|----------------|--------|--------| | CodeLlama-34B | full | 42.93 | 49.54 | | HyperAgent-Lite-3 | auto-retrieved | 38.33 | 53.33 | | WizardLM2+Sparse RAG | auto-retrieved | 34.16 | 51.23 |

Hydra (Le-Anh et al., 12 Feb 2026) refines RepoExec retrieval by treating code as structured data, not natural language:
- Structure-aware parsing to AST-granular atomic units (functions/classes/variables), forming a dependency graph.
- A dependency-aware retriever (DAR, UniXCoder encoder) explicitly identifies invoked dependencies by binary classification over query-context pairs; thresholding is tuned via balanced recall penalty.
- Hybrid retrieval (DAR + BM25 sparse search) fuses essential building blocks and usage examples.
- On RepoExec, Hydra achieves new state-of-the-art pass@1 and DIR (Qwen7B+Hydra: pass@1=23.32%, DIR=53.46%; GPT-4.1 mini+Hydra: pass@1=43.55%), consistently outperforming chunk-based RAG and even allowing small models to compete with much larger baselines by improving dependency recall.

6. Remote Code Execution and Reproducibility Platforms

RepoExec also names robust remote code execution engines (Hafiz et al., 2021). Key architectural elements include:

Cloud-native API gateway (Flask/Kubernetes) routing to language-specific “runlang” executors.
Language/library extensibility via imagegen—a YAML DSL to specify runtime, packages, and smoke tests, emitting version-pinned Dockerfiles, deployed via CI/CD.
Predictive autoscaling (PHPA): caching HPA history, KNN/linear regression forecasting, minimizing cost-SLA penalty via short-horizon forecast optimization.
User interface for code/state sharing via Redis-permalinks.
Sandboxed execution in per-language containers, global package caches, container-level isolation for reproducibility.

System-level evaluations report median latency <150ms at 100 req/s and 22% EC2 cost reduction versus reactive autoscaling (penalty ≈17.4 with KNN vs. ≈19.6 with linear regression) (Hafiz et al., 2021).

7. Role in Reproducibility and Experiment Packaging

RepoExec-style approaches have been generalized into reproducibility backends (Costa et al., 2023). Components include:

Data management via a graph database (Neo4j), object file store, Docker Hub container registry.
Environment configuration from project metadata (language, version, dependencies, database/imageId), yield complete, shareable Docker images.
Automated execution and reproducibility checking (SHA-256 checksums, log hashes) across repeated runs.
RESTful API for end-to-end workflow: project creation, code/data upload, environment build, execution, output and packaging for public distribution.
In practical evaluation, this architecture enabled 80% reproducibility for 25 published computational experiments (failures confined to missing data or undocumented install issues).

References

Nam et al. "On the Impacts of Contexts on Repository-Level Code Generation" (Hai et al., 2024)
Zhang et al. "Repo2Run: Automated Building Executable Environment for Code Repository at Scale" (Hu et al., 19 Feb 2025)
Qiu et al. "Do Not Treat Code as Natural Language: Implications for Repository-Level Code Generation and Beyond" (Le-Anh et al., 12 Feb 2026)
Ma et al. "HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale" (Phan et al., 2024)
Silva et al. "A Backend Platform for Supporting the Reproducibility of Computational Experiments" (Costa et al., 2023)
Zhang et al. "Architecture of a Flexible and Cost-Effective Remote Code Execution Engine" (Hafiz et al., 2021)