CrossCodeEval Benchmark

Updated 30 January 2026

CrossCodeEval is a multilingual, repository-level code completion benchmark that evaluates large language models on their ability to resolve cross-file dependencies in real-world software repositories.
It employs a rigorous methodology using static analysis and retrieval techniques to generate prompts requiring cross-file context, ensuring that code completions rely on accurate symbol resolution.
Evaluation metrics such as exact match, edit similarity, and identifier-level scores demonstrate significant performance gains when integrating hybrid retrieval approaches and static-analysis enhancements.

CrossCodeEval is a multilingual, repository-level code completion benchmark specifically designed to assess the capability of LLMs to generate code that accurately leverages context spanning multiple files within real-world software repositories. Unlike traditional code completion benchmarks—typically targeting single-file, function-level completions—CrossCodeEval rigorously enforces cross-file dependency, making it a unique and challenging resource for evaluating context-aware code intelligence (Ding et al., 2023).

1. Benchmark Structure and Dataset Construction

CrossCodeEval was curated from open-source GitHub repositories licensed permissively (MIT, Apache-2.0, BSD) and created within a narrow temporal window to minimize pre-training overlap. Repositories were filtered for small size (<1 MB), moderate popularity (≥3 stars), and manageable multi-file structure (10–50 source files), spanning four primary languages: Python, Java, TypeScript, and C# (Ding et al., 2023, Ouyang et al., 2024). The benchmark strictly requires that each completion task involve a cross-file API use, detected using static analysis:

For each repo file, all intra-project imports are identified (i.e., references to symbols defined in other files).
Each imported symbol is temporarily stubbed, and the file is passed to a language-specific static analyzer (e.g., Pylint for Python, javac for Java).
Any resulting "undefined name" or "no-member" error marks a location guaranteed to require cross-file resolution.
At every such site, the pre-cursor lines form the prompt; the remainder of the line or block is the blinded target for completion.
Examples are deduplicated and filtered to eliminate trivial cases, with low-quality or guessed completions discarded.

This pipeline produces, for example:

Python: 471 repos, 1,368 files, 2,665 examples
Java: 239 repos, 745 files, 2,139 examples
TypeScript: 193 repos, 779 files, 3,356 examples
C#: 99 repos, 642 files, 1,768 examples

Typical prompts average ~90–115 lines and ~900–1200 tokens, while completions are short (1–2 lines, 12–17 tokens), requiring precise cross-file symbol usage (Ding et al., 2023).

2. Problem Formulation and Cross-File Context

The core CrossCodeEval task is, for a given target file $f_{\text{target}}$ :

Input: All code prior to the blinded statement, imports preserved; the rest of the repository.
Output: The exact missing line or block, which depends on at least one symbol $s$ defined in $f'\neq f_{\text{target}}$ .

Formally, the set of evaluation sites is: $D = \{\, (f,\,t)\;|\; \exists\,f'\ne f,\; s\in S_{f'}\text{ and }s\text{ is used at }t\,\}$ where $S_{f'}$ denotes the set of symbols defined in file $f'$ .

This strict enforcement of cross-file context, combined with deduplication and context filtering, ensures that simple local or memorized completions are infeasible and that models are evaluated on their ability to retrieve and synthesize repository-wide information (Ding et al., 2023).

3. Evaluation Metrics

CrossCodeEval employs both string-level and identifier-level metrics, capturing surface correctness and semantic fidelity:

Code Match – Exact Match (CM-EM):

$\mathrm{CM\text{-}EM} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\bigl(\hat{y}_i = y_i\bigr) \times 100$

Fraction of completions exactly matching the ground truth (Ouyang et al., 2024).

Code Match – Edit Similarity (CM-ES):

$\mathrm{CM\text{-}ES} = \frac{1}{N}\sum_{i=1}^N \left[1 - \frac{\mathrm{EditDist}(\hat{y}_i, y_i)}{\max\{|\hat{y}_i|, |y_i|\}}\right] \times 100$

Normalized token-level edit similarity.

Identifier Match – Exact Match (IM-EM):

$\mathrm{IM\text{-}EM} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\bigl(\mathrm{IDs}(\hat{y}_i) = \mathrm{IDs}(y_i)\bigr) \times 100$

Exact identifier-level matching (variable and function names).

Identifier Match – F $_1$ Score (IM-F $_1$ ):

$\mathrm{IM\text{-}F_1} = \frac{1}{N}\sum_{i=1}^N \frac{2\cdot \mathrm{Precision}_i \cdot \mathrm{Recall}_i}{\mathrm{Precision}_i + \mathrm{Recall}_i} \times 100$

Harmonic mean of per-example identifier precision and recall.

Additional metrics such as edit distance, recall@k for retriever effectiveness, and aggregated multi-metric ranking are employed in advanced statistical analyses (Ackerman et al., 30 Jan 2025).

4. Baseline Methods and Results

Multiple solution paradigms are benchmarked on CrossCodeEval:

In-file only: LLMs given only the current file’s prefix. Baseline EM is consistently <11% across languages, demonstrating extreme difficulty without external context.
Retrieval-Augmented Generation (RAG): Top-5 retrieved code chunks (BM25, dense retrievers, analogy-based) prepended to the prompt. EM typically doubles or triples, e.g., CodeGen2.5–7B jumps from 7.7% (in-file) to 14.5% (+BM25) on Python (Ding et al., 2023).
Oracle retrieval: Using the ground-truth reference for context ranking, upper-bound EM remains <23%, indicating limitations even under ideal retrieval (Ding et al., 2023).
Static-analysis integration (STALL+): Prepending hierarchical summaries of imported modules, name-resolved token masks during decoding, and post-processing via static checks all yield significant boosts. File-level dependency prompting achieves 27.8% EM (Python) and 44.2% (Java); combinations with RAG and post-processing reach state-of-the-art EM of 34.7% and 52.1% respectively (Liu et al., 2024).

A sample comparison of methods (Python, DeepSeekCoder-6.7B) is shown below:

Method	Code EM	EditSim	ID EM	ID F₁
In-file	9.08%	51.3%	15.9%	48.0%
RepoFuse	26.8%	64.1%	37.0%	63.5%
RepoCoder	26.9%	64.6%	37.6%	64.4%
API-infer	35.4%	69.4%	46.4%	70.4%

RepoGraph, a plug-in module compiling repository-level code graphs, achieves similarly robust performance boosts; when incorporated into GPT-4o in the prompting phase, Code EM increases from 10.8% (baseline) to 28.5%, and ID EM from 16.7% to 36.1% (Ouyang et al., 2024). No formal significance tests were run, though observed gains are substantially above run-to-run variance.

5. Statistical Evaluation and System Ranking

CrossCodeEval supports rigorous statistical comparison of code LLMs across languages and metrics. The recommended protocol involves:

Paired/unpaired t-tests and Z-tests for effect sizes (Cohen’s d/h), with Holm–Šidák correction for family-wise error rates.
Multi-metric aggregation: Metrics (EM, ES, ID-P, ID-R) standardized and weighted, yielding aggregate rankings and boxplot-based CIs for model comparison (Ackerman et al., 30 Jan 2025).
Clique identification: Non-significance graph plots facilitate clique analysis of indistinguishable models.
Multi-criteria decision-making (TOPSIS, VIKOR, WSM, etc.): Empirically, aggregate means often correlate highly ( $\rho > 0.95$ ) with these ranking schemes, while formal statistical testing offers actionable confidence intervals for deployment or system upgrade decisions.

An observed outcome is that all tested model pairs on CrossCodeEval are significantly distinguishable at $\alpha=0.05$ ; top–bottom pairs have “large” effect size ( $d>0.8$ ), with model choices robustly supported by empirical comparison.

6. Integration with Static Analysis and Retrieval Methods

Recent research demonstrates two principal integration axes for improving CrossCodeEval performance:

Static analysis integration (STALL+, Prompt-F, Prompt-T): By automatically extracting and summarizing definitions of imported modules and resolving valid API tokens at generation sites, models are supplied with structured context that dramatically increases exact match rates. Prompt-level hierarchical dependency summaries outperform token lists and decoding-phase hard-masks except in Java, where complementary integration is optimal (Liu et al., 2024).
Hybrid RAG + static analysis: Combining retrieval of similar code or utilitarian snippets (RepoCoder, BM25, analogy-based) with static analysis yields additive gains. Best results are achieved by RAG + Prompt-F + Post-processing, maximizing both accuracy and semantic correctness, at the cost of higher inference latency.

Failure modes remain for complex multi-hop dependencies (beyond 1-hop), highly dynamic symbols (Python), and prompt-budget overflow with aggressive context augmentation.

7. Limitations and Future Directions

Despite its rigor, CrossCodeEval does not cover:

Full repository-scale code synthesis or document-level reasoning, as found in xCodeEval (Khan et al., 2023).
Multi-dimensional code attributes (e.g., efficiency, code quality) or runtime execution-based assessment (see COMPASS (Meaden et al., 19 Aug 2025), xCodeEval), limiting its ability to capture algorithmic efficiency and maintainability.

Planned future directions, as argued in RepoGraph and other work, include:

Extension to additional languages (e.g., C++, PHP, TypeScript) and richer task types (summarization, translation).
Incorporation of functional correctness (test-case execution), CodeBLEU, and other domain-specific metrics.
Adaptive context condensation via graph-based or summarization transformations to enable multi-hop dependency resolution without exceeding prompt constraints.
Statistical significance protocols for improved result reliability.
End-to-end fine-tuning on repository-level tasks for true cross-file reasoning capabilities.

A plausible implication is that systematic benchmarking frameworks like CrossCodeEval, especially when enhanced by graph, retrieval, and static analyses, provide the experimental foundation and actionable diagnostics necessary for the next generation of context-aware, repository-level AI engineering tools.