CrossCodeEval: Cross-File Code Evaluation

Updated 22 April 2026

CrossCodeEval is a multilingual benchmark that rigorously tests code completion models using cross-file dependency challenges from real-world repositories.
It employs static analysis to pinpoint non-trivial completion targets across Python, Java, TypeScript, and C♯, ensuring realistic evaluation settings.
Its strict evaluation protocol, leveraging metrics like Exact Match and Identifier F1, advances state-of-the-art context-aware code generation research.

CrossCodeEval is a multilingual benchmark designed to rigorously evaluate the capability of code completion models to resolve cross-file dependencies in realistic, repository-scale software engineering scenarios. Unlike previous benchmarks constrained to in-file completion, CrossCodeEval enforces strict requirements such that each test example demands retrieval and semantic reasoning over code dispersed across multiple files. Its construction leverages static analysis to pinpoint locations where code completion explicitly depends on external, intra-repository symbols and definitions, thereby simulating authentic challenges faced in large-scale software development and pushing the boundaries of context-aware code generation.

1. Benchmark Construction and Characteristics

CrossCodeEval was introduced by Ding et al. as the first benchmark strictly focused on repository-level, cross-file code generation. The benchmark comprises four major programming languages: Python, Java, TypeScript, and C♯. The datasets consist of real-world, open-source, permissively-licensed repositories, selected with stringent filters (creation in Mar–Jun 2023, star ≥3, size <1MB, 10–50 source files, exclusion from LLM pretraining datasets) to ensure diversity and minimize test-train leakage (Ding et al., 2023).

For each repository, static analysis is used to identify completion targets that necessarily require cross-file context—e.g., statements involving classes, functions, or constants defined only in external files. Each instance is then formulated as a code-completion "hole": given all tokens up to the missing snippet in one file (local context) and the ability to retrieve from the rest of the repository (global context), the model must generate exactly the reference code (typically a single line or a concise statement). In all languages, pre-processing steps filter out trivial, auto-solvable, or overly generic completions.

Dataset statistics (Python, Java as typical):

Language	#Repos	#Files	#Examples	Avg. Target Tokens
Python	471	1,368	2,665	14.45
Java	239	745	2,139	16.76

No standard training/validation/test splits are provided; CrossCodeEval is widely treated as a zero-shot, evaluation-only test set (Ding et al., 2023, Wang et al., 2024, Liu et al., 2024).

2. Task Formulation and Evaluation Protocol

Each task instance consists of a prompt (all code up to a cursor position in a given file) and a ground-truth reference snippet (the missing line or statement). The core requirement is that the missing statement depends on symbols—such as imported classes, called functions, or constants—not declared in the current file but available elsewhere in the same repository (Ding et al., 2023, Zhao et al., 4 Dec 2025).

To succeed, a model must ingest the prompt, optionally incorporate retrieved cross-file context (using retrieval-augmented generation or static analysis), and emit code that matches the reference under multiple evaluation criteria. Common retrieval approaches include BM25, dense vector search, dynamic graph traversal, hybrid retrieval, and static analysis–guided augmentation (Wang et al., 2024, Shah et al., 27 Sep 2025, Jiang et al., 27 Jan 2026).

An illustrative instance: given a Django management command in Python requesting to complete a function using a helper process_data imported from another file, the model must correctly synthesize the invocation based on available context and project structure (Zhao et al., 4 Dec 2025).

3. Evaluation Metrics

CrossCodeEval employs a strict and multifaceted evaluation protocol, with four principal metrics—computed per sample and aggregated across the dataset (Ding et al., 2023, Wang et al., 2024, Liang et al., 2024, Wang et al., 30 Jan 2026, Deng et al., 28 Jul 2025):

Exact Match (EM):

$\mathrm{EM} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[\hat y_i = y_i]$

The prediction must be byte-for-byte identical to the reference, including whitespace and symbols.

Edit Similarity (ES):

$\mathrm{ES} = \frac{1}{N}\sum_{i=1}^N \left(1 - \frac{\mathrm{LevDist}(\hat y_i,\,y_i)}{\max(|\hat y_i|,\,|y_i|)}\right)$

Normalized Levenshtein distance reflects partially correct outputs.

Identifier Exact Match (ID-EM):

$\mathrm{ID\_EM} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[\operatorname{id}(\hat y_i) = \operatorname{id}(y_i)]$

Requires all predicted identifiers (variable/class/function names) to exactly match ground truth.

Identifier F1 (ID-F1):

$\operatorname{Prec}_i = \frac{|\operatorname{id}(\hat y_i) \cap \operatorname{id}(y_i)|}{|\operatorname{id}(\hat y_i)|},\quad \operatorname{Rec}_i = \frac{|\operatorname{id}(\hat y_i) \cap \operatorname{id}(y_i)|}{|\operatorname{id}(y_i)|}$

$\mathrm{ID\_F1} = \frac{1}{N}\sum_{i=1}^N 2\cdot\frac{\operatorname{Prec}_i \cdot \operatorname{Rec}_i}{\operatorname{Prec}_i + \operatorname{Rec}_i}$

Extended studies also report recall, pass@k, and context retrieval precision, but EM and ES remain central (Ding et al., 2023, Wang et al., 30 Jan 2026).

4. Comparative Context and Distinctive Features

CrossCodeEval is distinguished from earlier benchmarks by several design principles (Ding et al., 2023, Liu et al., 2024, Zhao et al., 4 Dec 2025):

Forced cross-file dependencies: All test instances mandate reference to code outside the immediate file. Simple in-file completions are excluded.
Multilinguality: Four languages (Python, Java, TypeScript, C♯), facilitating broad generalization studies and robust evaluation (Ding et al., 2023).
Real-world scale: Hundreds of repositories per language, spanning 1,000s of files.
High rigor in sample curation: Use of static analysis and AST-level checking prevents trivial and duplicative solutions.
Strict evaluation: EM is notably unforgiving, and identifier-based metrics penalize partial or semantically incorrect completions.

In contrast, RepoEval focuses on in-file or limited cross-file completions in a smaller set of large repositories, while ProjBench alters import masking strategies and is focused on internal API representation. CrossCodeEval is considered the reference standard for any method claiming cross-file code understanding (Wang et al., 2024, Deng et al., 28 Jul 2025, Chen et al., 13 Aug 2025).

5. Retrieval, Static Analysis, and System Performance

A broad array of retrieval and static analysis enhancements have been benchmarked on CrossCodeEval:

Method	Retrieval Type	Key Innovation	Max EM (Python/Java)
BM25	Lexical (BM25)	Sparse lexical ranking	~21%
RLCoder	RL policy, learned retrieval	Perplexity-based RL/sampling	~36–40%
RANGER	Graph-based + BM25	Cypher on repo KG + lexical fusion	36.3%
RepoFuse	Fusion (analogy + rationale context)	Rank-truncated prompt, context condensation	28–30%
SaraCoder	Hierarchical semantic & structural	Entropy-pruned retrieval, identifier disambig.	up to 40%
AlignCoder	Query enhancement + RL retriever	Sampled completions for query, RL refinement	34%
STALL⁺	Static analysis integration	Prompt, decode, post-phase dependency cues	28–46% (Java), 29% (Python); >50% when combined with RAG
GrepRAG	Index-free lexical + re-ranking	ripgrep-command generation, identifier refinement	42–44%
CodexGraph	LLM-driven, graph database (Neo4j)	Iterative Cypher query, structure-based selection	27.9% (GPT-4o)
CoCo	Multi-granular static extraction	Project/file/function-level static context	up to +20.2% EM over baseline

Across all methods, combining static analysis with RAG or graph/lexical retrieval yields the highest gains, particularly in static languages (e.g. Java) (Liu et al., 2024, Wang et al., 30 Jan 2026). Methods based purely on BM25 retrieval are consistently outperformed by approaches that inject semantic, structural, or dependency-aware context (Wang et al., 2024, Shah et al., 27 Sep 2025, Zhao et al., 4 Dec 2025).

Current SOTA for 7B–16B LLMs with advanced retrieval and static analysis achieves EM up to 44% (Java; GrepRAG) and ~35% (Python; STALL⁺ + RAG) (Wang et al., 30 Jan 2026, Liu et al., 2024), but absolute performance remains well below single-file tasks.

6. Evaluation Procedure and Statistical Analysis

Recent work also addresses multi-metric and statistically robust evaluation across CrossCodeEval’s multi-lingual splits (Ackerman et al., 30 Jan 2025). For a given system and dataset ( $D_j$ ), metrics are paired per sample; paired t-tests and Cohen’s $d$ are the default tools for testing significance. Aggregate system rankings employ within-language score standardization and across-language harmonic mean p-value combination.

Visualization includes boxplots for value and rank, with statistical cliques for indistinguishable systems. Leading LLMs such as CodeLlama-13B-hf, CodeGemma-7B, and Granite-34B have been empirically ranked under this protocol, revealing significant and robust performance differences between contemporary models (Ackerman et al., 30 Jan 2025).

7. Limitations, Applications, and Future Directions

While CrossCodeEval has become the reference evaluation suite for cross-file code completion, several limitations remain explicit in the literature:

The benchmark is by construction zero-shot; no public training split is offered (Ding et al., 2023, Wang et al., 2024, Zhao et al., 4 Dec 2025).
Average snippet and prompt lengths are moderate (~15 tokens/line; prompts ~900–1,000 tokens), which stresses context retrieval but not very long-form generation (Ding et al., 2023, Liu et al., 2024).
Most methods report results only on Python and Java; TypeScript and C♯ have less extensive published results (Ding et al., 2023).
Masking policies (e.g., on imports) affect challenge severity and comparability for certain retrieval baselines (Deng et al., 28 Jul 2025).
Identifier-based metrics, while informative, do not fully guarantee semantic correctness; execution-based or functional correctness measures (e.g., unit-test pass rate) are rare (Wu et al., 2024).

Applications of CrossCodeEval span leaderboard evaluations, retrieval algorithm benchmarking, and ablation studies in code LLM and RAG systems. Future work cited in the literature converges on enhancing retrieval efficiency (hybrid lexical/semantic/graph retrievers), integrating program structure at finer granularity, maintaining up-to-date indices for evolving codebases, and improving static analysis for dynamic languages (Shah et al., 27 Sep 2025, Zhao et al., 4 Dec 2025, Liu et al., 2024).

CrossCodeEval remains a critical testbed for measuring progress in context-aware, repository-level code understanding and completion, establishing a high bar for both retrieval and generation components in modern code LLMs.