RepoExEval: Repository-Level Exception Benchmark
- RepoExEval is a repository-level benchmark that evaluates LLMs' exception-handling capabilities in real-world Java (Android) projects.
- It leverages deep call-trace analysis and API–exception mapping from 120,000 annotated try–catch instances across 3,200 Android repositories.
- RepoExEval and its executable subset, RepoExEval-Exec, assess both static and dynamic performance using metrics like Pass@1, CodeBLEU, and IntentAcc.
RepoExEval is a large-scale, repository-level benchmark designed to evaluate exception-handling capabilities of LLMs in real-world Java (Android) codebases. Developed in the context of the CatchAll system, RepoExEval systematically assesses the accuracy, intent alignment, and functional correctness of exception-handling code generation at scale, stressing deep, cross-file context and diverse exception vocabulary. Its companion subset, RepoExEval-Exec, provides executable test scenarios for functional evaluation with real unit tests. Collectively, these benchmarks anchor the state-of-the-art in repository-aware exception handling evaluation for LLMs (Tao et al., 3 Jan 2026).
1. Data Construction and Benchmark Design
RepoExEval is built from a curated set of 3,200 well-maintained Android Java repositories, selected via GitHub queries for “Android” and “Java”, filtered by project popularity (star count), F-Droid catalog listing, and temporal cutoffs to minimize LLM data leakage. Tree-sitter’s abstract syntax tree (AST) extraction is utilized to locate each try–catch block. For each try block, a contextual call trace is computed: the algorithm records the imports, enclosing class/method, and recursively unrolls all API calls in the block, tracing their definitions across the repository to a configurable depth (mean call-trace depth: 7.2; mean cross-file span: 5.7 files). Each instance consists of metadata, the try/catch code, caught exception type, and the full cross-file context.
An empirically constructed API–exception mapping is built by aggregating all try–catch sites across the dataset: each API discovered anywhere in the extended context of a catch is mapped to the exception(s) handled in that context. This produces a many-to-many API–exception table, crucial for guiding both exception type inference and handling-pattern mining. These extracted mappings, call traces, and ground-truth labels together enable repository- and API-aware evaluation.
2. Dataset Composition and Statistical Properties
RepoExEval includes approximately 120,000 annotated try–catch instances, each capturing a code context, the caught exception, and comprehensive repository-level metadata. The held-out test set consists of 1,000 randomly sampled instances; the remaining 119,000 are used as a reference corpus, informing retrieval and API–exception mapping. APIs span standard JDK/Android-SDK, common libraries (e.g., okhttp, gson), and user-defined methods; exception coverage ranges over hundreds of Java types, from NullPointerException to IllegalArgumentException.
RepoExEval-Exec is an executable subset spanning four Android apps (Aria2App, openScale, Overchan-Android, Signal-Android), comprising 100 real-world exception sites with 174 unit tests. These were selected for their existing test infrastructure and minimal dependency entanglement. Typical call-trace depth in this subset is 7.7, with cross-file propagation averaging 6.1 files and project sizes exceeding 600 Java classes.
| Subset | #Repos | #Instances | Mean Trace Depth | Mean Cross-file Span | Test Coverage (Exec) |
|---|---|---|---|---|---|
| RepoExEval | 3,200 | 120,000 | 7.2 | 5.7 | Static |
| RepoExEval-Exec | 4 | 100 | 7.7 | 6.1 | 174 unit tests |
3. Annotation Schema and Ground Truth
RepoExEval employs a dual-labeling schema: (a) exception type—capturing the specific Java exception class declared in the catch clause; (b) exception-handling intent—a multi-label enumeration across five canonical categories:
- Logging (e.g., Log.e, logger.warn)
- Retry (loop or recursive retry logic)
- Recovery/fallback (default value assignment)
- Return (early return with default/error code)
- Rethrow (explicit propagation)
Test set intents (1,000 instances) are manually annotated by expert raters with conflict resolution through discussion. For the training/validation portion, intent labels are derived via rule-based heuristics operating on catch bodies (e.g., Log.* invocations signify Logging), spot-verified for accuracy (>95%). Ground-truth exception types are always taken from the declared catch parameter.
Quality controls explicitly filter for pre-training data leakage (temporal cutoffs) and overlap with benchmarks such as CodeSearchNet (4.4% overlap), maintaining novelty for LLM evaluation.
4. Evaluation Protocols and Metrics
RepoExEval and RepoExEval-Exec are evaluated along both static and dynamic axes:
- Pass@k: For executable cases (RepoExEval-Exec), generated exception handlers are integrated into the original project and submitted to the relevant unit tests; Pass@1 denotes the probability a single model completion passes all tests. Explicitly,
- CodeBLEU: Measures lexical, syntactic (AST), and semantic (data-flow) correspondence to ground truth. Weighted combination:
where denotes -gram matching, AST match, data flow, and comment match with empirically set weights.
- IntentAcc: Handling intent accuracy, formulated as
where denotes gold intents, predicted intents for instance .
Multiple baselines are evaluated, including ExAssist (rule-based), Nexgen (NMT encoder–decoder), Direct-Prompting (vanilla LLM), RepoCoder (repo-level code completion), KPC (API-doc chaining), and Seeker (multi-agent).
5. Empirical Findings and Comparative Performance
On the main RepoExEval split (GPT-4o backbone), CatchAll achieves Pass@1=29%, CodeBLEU=0.31, and IntentAcc=60.1%, outperforming the next best (RepoCoder: Pass@1=25%, CodeBLEU=0.27, IntentAcc=48.0%). These gains reflect improved functional correctness (+4 percentage points Pass@1), structure/semantics alignment (+14.8% CodeBLEU, relative), and intent inference (+12.1 percentage points). On RepoExEval-Exec, CatchAll remains the strongest method, with ~30% Pass@1 compared to ~25% for the next best method.
The challenge of RepoExEval arises from its deep (mean trace depth >7), cross-file context, the large and diverse API–Exception vocabulary, and repository-level propagation of exception semantics. Diverse handling intent patterns in real code further stress model generalization. An example from WordPress Android demonstrates the need for call-trace-based API–Exception inference and retrieval of historical fallback templates, whereas baseline models often catch overly generic exceptions or fail to enact meaningful recovery logic.
6. Relation to Other Repository-Level Evaluation Frameworks
RepoExEval complements a broader trend toward executable, context-rich benchmarks exemplified by ExecRepoBench (Yang et al., 2024). Both benchmarks emphasize repository-level, multi-file, execution-backed evaluation. While ExecRepoBench is Python-focused with multi-level AST masking and Pass@k assessment leveraging comprehensive project unit tests, RepoExEval targets Java (Android) and exception handling, featuring deep call-trace context, API–exception mapping, and fine-grained handling intent annotation. Methods such as Qwen2.5-Coder-Instruct-C, tuned on similar repository-level, grammar-based completion corpora, demonstrate the importance and impact of realistic, cross-file benchmarks in driving LLM code intelligence.
7. Significance and Research Impact
RepoExEval and RepoExEval-Exec constitute the largest and most challenging benchmarks to date for evaluating repository-aware exception handling in LLM code generation. Their construction methodology—combining deep, structured repository context with realistic, diverse exception scenarios—enables robust, functional assessment of advanced LLM strategies such as knowledge-guided prompt construction. The demonstrated performance gaps between CatchAll and prior baselines reveal persistent difficulties in handling repository-scale context, exception propagation, and intent alignment, motivating continued development of context- and knowledge-augmented LLM systems (Tao et al., 3 Jan 2026). The cross-pollination with related benchmarks such as ExecRepoBench further underlines the centrality of execution-based, cross-file evaluations for future LLM research in code intelligence.