RepoExEval-Exec Evaluation Framework

Updated 7 January 2026

The paper demonstrates that RepoExEval-Exec is a dynamic framework that injects model-generated code into real repositories, verifying correctness via compilation and execution of unit tests.
It utilizes multi-stage pipelines with rigorous dependency management and sandboxing to handle cross-file interactions and exception handling in complex codebases.
Empirical results using metrics like Pass@1 reveal significant performance gaps, emphasizing the need for execution-based evaluations over static benchmarks.

RepoExEval-Exec denotes a class of repository-level, execution-based evaluation frameworks for code generation, code completion, and code understanding models, where correctness is verified by integrating model outputs into real code bases and validating them via actual compilation and automated test execution. Unlike function-level or static benchmarks, RepoExEval-Exec systems explicitly test code in its genuine repository context—across multiple files and dependencies—using dynamic analysis to ensure that generated code both compiles and passes existing or generated unit tests. Representative benchmarks and systems include "RepoExEval-Exec" for exception handling (Tao et al., 3 Jan 2026), "^{^{^{^{1^{^{^{^"}}}}}}} for code completion (Yang et al., 2024), CodeBenchGen (Xie et al., 2024), xCodeEval/ExecEval (Khan et al., 2023), and RepoST (Xie et al., 10 Mar 2025).

1. Motivation and Conceptual Foundations

RepoExEval-Exec frameworks address the critical limitations of traditional code generation benchmarks, which typically rely on function-level tasks, static string-matching metrics, or synthetic datasets. These approaches do not reflect the realities of modern software engineering where:

Real-world code is distributed across multiple files and modules with cross-file, cross-class dependencies.
Correctness often cannot be verified by string similarity or static analysis alone; actual execution against a test suite is required.
Contextual and functional dependencies—such as exception propagation or API surface interplay—are central to practical software reliability (Tao et al., 3 Jan 2026, Yang et al., 2024).
Overfitting to memorized snippets or test leakage is a concern; repo-level randomization and decontamination are needed (Yang et al., 2024).

RepoExEval-Exec frameworks thus require that generated code snippets are (1) injected back into running code bases, (2) compiled and executed as part of the repository, and (3) evaluated using repository- or file-level unit tests or functional assertions.

2. Benchmark Construction and Dataset Characteristics

RepoExEval-Exec benchmarks are constructed through multi-stage pipelines emphasizing code diversity, dependency integrity, and testability:

Repository Selection: Repositories are selected based on recency, maintenance, language criteria, license, and test coverage. For instance, RepoExEval-Exec (for exception handling) is built from four high-profile Android projects (Aria2App, openScale, Overchan-Android, Signal-Android) (Tao et al., 3 Jan 2026). ExecRepoBench uses 50 actively maintained Python repositories with strict decontamination protocols to avoid test set contamination (Yang et al., 2024).
Sample Extraction and Filtering: Exception handling sites or code completion spans are identified, ensuring that the insertion point is covered by an executable test—i.e., at least one developer-written or LLM-generated unit test exercises the relevant behavior (Tao et al., 3 Jan 2026, Xie et al., 2024, Xie et al., 10 Mar 2025).
Dependency Management: For each target, all required local module functions/classes are extracted, and third-party dependencies are precisely pinned and re-resolved (typically via build systems, or through custom context retrieval for Python) (Yang et al., 2024, Xie et al., 10 Mar 2025).
Test Harness Construction: Tests must directly trigger the target behavior (e.g., exception throw/catch, function call with dependency cross-file). For datasets like RepoST, most test generation is automated via LLMs, then debugged via iterative execution and LLM-guided patching to reach full branch or statement coverage (Xie et al., 10 Mar 2025).

RepoExEval-Exec subsets are generally smaller (RepoExEval-Exec: 100 handler sites, 174 tests, 4 Java projects; ExecRepoBench: ~1 200 code-masked samples from 50 Python repos) but are selected for "high-fidelity" in integration and test triggering, covering a wide cross-section of real code patterns and dependencies (Tao et al., 3 Jan 2026, Yang et al., 2024).

3. Methodological Approaches and Execution Protocols

RepoExEval-Exec frameworks follow a strict execution-based protocol:

Code Insertion and Compilation: Model outputs (e.g., a catch block, code fragment, function) are injected at predefined locations in source code. The repository is recompiled from scratch to ensure syntactic and dependency correctness (Tao et al., 3 Jan 2026, Yang et al., 2024).
Standardized Execution Harness: The repository’s test suite is executed in its native environment (e.g., via Gradle/Maven for Java, pytest/unittest for Python). Only if the new code passes all relevant tests without errors is it deemed correct (Tao et al., 3 Jan 2026, Yang et al., 2024, Xie et al., 10 Mar 2025).
Sandboxing and Dependency Isolation: For repositories with complex build systems or nontrivial I/O, sandboxing is applied, isolating the target code and its minimal dependency set. Approaches such as RepoST automatically mock or stub file I/O or third-party API calls, ensuring that code is testable in isolation while preserving behavioral equivalence checked via strict AST comparison (Xie et al., 10 Mar 2025).
Iterative Debugging and Coverage Enhancement: Examples failing initial test runs are patched via LLM-guided debugging, with additional tests or code mutations until either coverage goals are reached or a fixed debug budget is exhausted (Xie et al., 2024, Xie et al., 10 Mar 2025).

A general pseudocode skeleton for such evaluation protocols is:

for site in repo_targets:
    inject(model_generated_code, site)
    if not compile_project():
        continue
    for test in relevant_tests(site):
        if not run_test(test):
            mark_incorrect(site)
            break
    else:
        mark_correct(site)

4. Evaluation Metrics and Baseline Comparison

RepoExEval-Exec benchmarks predominantly employ execution-based metrics, augmented with static code quality and intent-aware measures if applicable.

Pass@k: For N benchmark cases, Pass@k is the proportion where at least one of k generated candidates passes all required tests:

$\text{Pass@}k = 1 - \frac{\binom{N - s}{k}}{\binom{N}{k}}$

For $k=1$ , Pass@1 is the leading metric for real-world correctness (Tao et al., 3 Jan 2026, Yang et al., 2024, Xie et al., 2024, Xie et al., 10 Mar 2025).

CodeBLEU: Weighted sum of n-gram overlap, AST similarity, and semantic match, applied at the structural level for injected blocks (applied to static splits only) (Tao et al., 3 Jan 2026).
Intent Accuracy: Tag-wise accuracy for predicted exception handling, e.g., correct logging, retry, or fallback patterns (Tao et al., 3 Jan 2026).
Edit Similarity (ES): F1-style n-gram overlap between generated and ground-truth code, primarily for completion (Yang et al., 2024).

Empirical baseline results for key RepoExEval-Exec-style benchmarks include:

Benchmark	Model	Pass@1 (%)	Notes
RepoExEval-Exec	CatchAll (GPT-4o)	29	Next-best: RepoCoder 25% (Tao et al., 3 Jan 2026)
ExecRepoBench	Qwen2.5-Coder-Instruct	44.2	Best prior OS: DS-Coder 29.5% (Yang et al., 2024)
RepoST-Eval	DS-R1-Qwen-32B	34.46	GPT-4o: 39.53% (Xie et al., 10 Mar 2025)
Exec-CSN	GPT-4	37.21	CodeLlama-70B: 30.92% (Xie et al., 2024)

These values highlight the persistent gap between open-source and proprietary models in real-world executable evaluation (Yang et al., 2024, Tao et al., 3 Jan 2026, Xie et al., 10 Mar 2025, Xie et al., 2024).

5. Architectural Extensions and Integration Patterns

RepoExEval-Exec frameworks are highly modular and designed for extensibility:

Multi-language Support: xCodeEval provides a template for scaling execution to 11 languages through a unified engine (ExecEval), using containerization and per-language runtime selection for code execution and unit testing (Khan et al., 2023).
Specialized Execution Contexts: For exception handling (CatchAll), evaluation incorporates call-trace analysis (average call-trace depth 7.7, cross-file span 6.1), which informs both prompt design and correctness evaluation in Java/Android projects (Tao et al., 3 Jan 2026).
AST-Conditioned Masking and Multi-level Sampling: ExecRepoBench employs AST parsing to select mask targets at the expression, statement, and function level, increasing the coverage and realism of code completion benchmarks (Yang et al., 2024).
Automated Sandbox and Test Synthesis: RepoST and CodeBenchGen both automate the environment construction and test generation process via LLMs, ensuring that the executable context is sufficiently isolated to enable codegen evaluation at scale (Xie et al., 10 Mar 2025, Xie et al., 2024).

Such design patterns enable not just benchmarking, but also direct training (or self-improvement) of LLMs under realistic, repo-level conditions.

6. Analysis, Ablations, and Implications

Ablation studies across RepoExEval-Exec-style benchmarks consistently demonstrate the necessity of repository-level context and knowledge integration:

Removing exception-type mappings (CatchAll): –67% Pass@1, as correct handling is impossible without knowing APIs' exception semantics (Tao et al., 3 Jan 2026).
Stripping out call-trace context: –20% Pass@1 due to missing cross-file execution flows (Tao et al., 3 Jan 2026).
Omitting knowledge-mined historical patterns: –27.6% Pass@1 (Tao et al., 3 Jan 2026).
In ExecRepoBench, grammar-based masking (AST-based) yields richer, more challenging benchmark instances and better aligns with real development workflows than random span masking (Yang et al., 2024).

A plausible implication is that static benchmarks—however large—will continue to underestimate the real-world error rate of code LLMs, and that practical deployment in IDEs or PR review agents must rely on execution-driven, repo-level evaluation protocols.

7. Limitations and Prospects

Despite their rigor, RepoExEval-Exec frameworks carry some limitations:

Scale: Most executable subsets are small (e.g., RepoExEval-Exec N≈100, RepoST-Eval N=296) due to the heavy cost of guaranteed test coverage, dependency resolution, and compilation (Tao et al., 3 Jan 2026, Xie et al., 10 Mar 2025).
Language and Platform Scope: Python and Java dominate, as dependency management, build, and test workflows are best understood for these ecosystems; extending to C/C++ requires new sandboxing and analysis techniques (Tao et al., 3 Jan 2026, Yang et al., 2024, Khan et al., 2023).
Human-in-the-Loop Requirements: While test and sandbox construction can be LLM-augmented, high-fidelity evaluation often requires manual curation, especially for error-prone code paths or complex dependencies (Xie et al., 10 Mar 2025, Xie et al., 2024).
Metric Coverage: Execution-based correctness does not capture code style, subtle semantics, or performance; advanced metrics like CodeBLEU and intent accuracy are only approximations of deeper qualities (Tao et al., 3 Jan 2026).

Future extensions include broader language coverage, integration of richer reasoning supervision (e.g., execution-trace-grounded CoT (Thakur et al., 28 Nov 2025)), and larger-scale automated pipelines leveraging LLM-based repair and test generation for multi-repo, multi-functional benchmarking. The conceptual blueprint of RepoExEval-Exec underpins ongoing efforts to align AI models with the real demands of practical software development, debugging, and maintenance.