RepoExEval: Repository-Level Exception Benchmark

Updated 7 January 2026

RepoExEval is a repository-level benchmark that evaluates LLMs' exception-handling capabilities in real-world Java (Android) projects.
It leverages deep call-trace analysis and API–exception mapping from 120,000 annotated try–catch instances across 3,200 Android repositories.
RepoExEval and its executable subset, RepoExEval-Exec, assess both static and dynamic performance using metrics like Pass@1, CodeBLEU, and IntentAcc.

RepoExEval is a large-scale, repository-level benchmark designed to evaluate exception-handling capabilities of LLMs in real-world Java (Android) codebases. Developed in the context of the CatchAll system, RepoExEval systematically assesses the accuracy, intent alignment, and functional correctness of exception-handling code generation at scale, stressing deep, cross-file context and diverse exception vocabulary. Its companion subset, RepoExEval-Exec, provides executable test scenarios for functional evaluation with real unit tests. Collectively, these benchmarks anchor the state-of-the-art in repository-aware exception handling evaluation for LLMs (Tao et al., 3 Jan 2026).

1. Data Construction and Benchmark Design

RepoExEval is built from a curated set of 3,200 well-maintained Android Java repositories, selected via GitHub queries for “Android” and “Java”, filtered by project popularity (star count), F-Droid catalog listing, and temporal cutoffs to minimize LLM data leakage. Tree-sitter’s abstract syntax tree (AST) extraction is utilized to locate each try–catch block. For each try block, a contextual call trace is computed: the algorithm records the imports, enclosing class/method, and recursively unrolls all API calls in the block, tracing their definitions across the repository to a configurable depth (mean call-trace depth: 7.2; mean cross-file span: 5.7 files). Each instance consists of metadata, the try/catch code, caught exception type, and the full cross-file context.

An empirically constructed API–exception mapping is built by aggregating all try–catch sites across the dataset: each API discovered anywhere in the extended context of a catch is mapped to the exception(s) handled in that context. This produces a many-to-many API–exception table, crucial for guiding both exception type inference and handling-pattern mining. These extracted mappings, call traces, and ground-truth labels together enable repository- and API-aware evaluation.

2. Dataset Composition and Statistical Properties

RepoExEval includes approximately 120,000 annotated try–catch instances, each capturing a code context, the caught exception, and comprehensive repository-level metadata. The held-out test set consists of 1,000 randomly sampled instances; the remaining 119,000 are used as a reference corpus, informing retrieval and API–exception mapping. APIs span standard JDK/Android-SDK, common libraries (e.g., okhttp, gson), and user-defined methods; exception coverage ranges over hundreds of Java types, from NullPointerException to IllegalArgumentException.

RepoExEval-Exec is an executable subset spanning four Android apps (Aria2App, openScale, Overchan-Android, Signal-Android), comprising 100 real-world exception sites with 174 unit tests. These were selected for their existing test infrastructure and minimal dependency entanglement. Typical call-trace depth in this subset is 7.7, with cross-file propagation averaging 6.1 files and project sizes exceeding 600 Java classes.

Subset	#Repos	#Instances	Mean Trace Depth	Mean Cross-file Span	Test Coverage (Exec)
RepoExEval	3,200	120,000	7.2	5.7	Static
RepoExEval-Exec	4	100	7.7	6.1	174 unit tests

3. Annotation Schema and Ground Truth

RepoExEval employs a dual-labeling schema: (a) exception type—capturing the specific Java exception class declared in the catch clause; (b) exception-handling intent—a multi-label enumeration across five canonical categories:

Logging (e.g., Log.e, logger.warn)
Retry (loop or recursive retry logic)
Recovery/fallback (default value assignment)
Return (early return with default/error code)
Rethrow (explicit propagation)

Test set intents (1,000 instances) are manually annotated by expert raters with conflict resolution through discussion. For the training/validation portion, intent labels are derived via rule-based heuristics operating on catch bodies (e.g., Log.* invocations signify Logging), spot-verified for accuracy (>95%). Ground-truth exception types are always taken from the declared catch parameter.

Quality controls explicitly filter for pre-training data leakage (temporal cutoffs) and overlap with benchmarks such as CodeSearchNet (4.4% overlap), maintaining novelty for LLM evaluation.

4. Evaluation Protocols and Metrics

RepoExEval and RepoExEval-Exec are evaluated along both static and dynamic axes:

Pass@k: For executable cases (RepoExEval-Exec), generated exception handlers are integrated into the original project and submitted to the relevant unit tests; Pass@1 denotes the probability a single model completion passes all tests. Explicitly,

$\mathrm{Pass@1} = \frac{\#\,\text{passing instances}}{\#\,\text{total instances}}$

CodeBLEU: Measures lexical, syntactic (AST), and semantic (data-flow) correspondence to ground truth. Weighted combination:

$\mathrm{CodeBLEU} = w_{\mathrm{ng}} \sum_{n=1}^{4} P_n + w_{\mathrm{ast}} S_{\mathrm{ast}} + w_{\mathrm{df}} S_{\mathrm{df}} + w_{\mathrm{com}} S_{\mathrm{com}}$

where $P_n$ denotes $n$ -gram matching, $S_{\mathrm{ast}}$ AST match, $S_{\mathrm{df}}$ data flow, and $S_{\mathrm{com}}$ comment match with empirically set weights.

IntentAcc: Handling intent accuracy, formulated as

$\mathrm{IntentAcc} = \frac{1}{N} \sum_{i=1}^N \frac{|Y_i \cap \hat{Y}_i|}{|Y_i \cup \hat{Y}_i|}$

where $Y_i$ denotes gold intents, $\hat{Y}_i$ predicted intents for instance $i$ .

Multiple baselines are evaluated, including ExAssist (rule-based), Nexgen (NMT encoder–decoder), Direct-Prompting (vanilla LLM), RepoCoder (repo-level code completion), KPC (API-doc chaining), and Seeker (multi-agent).

5. Empirical Findings and Comparative Performance

On the main RepoExEval split (GPT-4o backbone), CatchAll achieves Pass@1=29%, CodeBLEU=0.31, and IntentAcc=60.1%, outperforming the next best (RepoCoder: Pass@1=25%, CodeBLEU=0.27, IntentAcc=48.0%). These gains reflect improved functional correctness (+4 percentage points Pass@1), structure/semantics alignment (+14.8% CodeBLEU, relative), and intent inference (+12.1 percentage points). On RepoExEval-Exec, CatchAll remains the strongest method, with ~30% Pass@1 compared to ~25% for the next best method.

The challenge of RepoExEval arises from its deep (mean trace depth >7), cross-file context, the large and diverse API–Exception vocabulary, and repository-level propagation of exception semantics. Diverse handling intent patterns in real code further stress model generalization. An example from WordPress Android demonstrates the need for call-trace-based API–Exception inference and retrieval of historical fallback templates, whereas baseline models often catch overly generic exceptions or fail to enact meaningful recovery logic.

6. Relation to Other Repository-Level Evaluation Frameworks

RepoExEval complements a broader trend toward executable, context-rich benchmarks exemplified by ExecRepoBench (Yang et al., 2024). Both benchmarks emphasize repository-level, multi-file, execution-backed evaluation. While ExecRepoBench is Python-focused with multi-level AST masking and Pass@k assessment leveraging comprehensive project unit tests, RepoExEval targets Java (Android) and exception handling, featuring deep call-trace context, API–exception mapping, and fine-grained handling intent annotation. Methods such as Qwen2.5-Coder-Instruct-C, tuned on similar repository-level, grammar-based completion corpora, demonstrate the importance and impact of realistic, cross-file benchmarks in driving LLM code intelligence.

7. Significance and Research Impact

RepoExEval and RepoExEval-Exec constitute the largest and most challenging benchmarks to date for evaluating repository-aware exception handling in LLM code generation. Their construction methodology—combining deep, structured repository context with realistic, diverse exception scenarios—enables robust, functional assessment of advanced LLM strategies such as knowledge-guided prompt construction. The demonstrated performance gaps between CatchAll and prior baselines reveal persistent difficulties in handling repository-scale context, exception propagation, and intent alignment, motivating continued development of context- and knowledge-augmented LLM systems (Tao et al., 3 Jan 2026). The cross-pollination with related benchmarks such as ExecRepoBench further underlines the centrality of execution-based, cross-file evaluations for future LLM research in code intelligence.