Papers
Topics
Authors
Recent
Search
2000 character limit reached

CrossCodeEval: Cross-File Code Evaluation

Updated 22 April 2026
  • CrossCodeEval is a multilingual benchmark that rigorously tests code completion models using cross-file dependency challenges from real-world repositories.
  • It employs static analysis to pinpoint non-trivial completion targets across Python, Java, TypeScript, and C♯, ensuring realistic evaluation settings.
  • Its strict evaluation protocol, leveraging metrics like Exact Match and Identifier F1, advances state-of-the-art context-aware code generation research.

CrossCodeEval is a multilingual benchmark designed to rigorously evaluate the capability of code completion models to resolve cross-file dependencies in realistic, repository-scale software engineering scenarios. Unlike previous benchmarks constrained to in-file completion, CrossCodeEval enforces strict requirements such that each test example demands retrieval and semantic reasoning over code dispersed across multiple files. Its construction leverages static analysis to pinpoint locations where code completion explicitly depends on external, intra-repository symbols and definitions, thereby simulating authentic challenges faced in large-scale software development and pushing the boundaries of context-aware code generation.

1. Benchmark Construction and Characteristics

CrossCodeEval was introduced by Ding et al. as the first benchmark strictly focused on repository-level, cross-file code generation. The benchmark comprises four major programming languages: Python, Java, TypeScript, and C♯. The datasets consist of real-world, open-source, permissively-licensed repositories, selected with stringent filters (creation in Mar–Jun 2023, star ≥3, size <1MB, 10–50 source files, exclusion from LLM pretraining datasets) to ensure diversity and minimize test-train leakage (Ding et al., 2023).

For each repository, static analysis is used to identify completion targets that necessarily require cross-file context—e.g., statements involving classes, functions, or constants defined only in external files. Each instance is then formulated as a code-completion "hole": given all tokens up to the missing snippet in one file (local context) and the ability to retrieve from the rest of the repository (global context), the model must generate exactly the reference code (typically a single line or a concise statement). In all languages, pre-processing steps filter out trivial, auto-solvable, or overly generic completions.

Dataset statistics (Python, Java as typical):

Language #Repos #Files #Examples Avg. Target Tokens
Python 471 1,368 2,665 14.45
Java 239 745 2,139 16.76

No standard training/validation/test splits are provided; CrossCodeEval is widely treated as a zero-shot, evaluation-only test set (Ding et al., 2023, Wang et al., 2024, Liu et al., 2024).

2. Task Formulation and Evaluation Protocol

Each task instance consists of a prompt (all code up to a cursor position in a given file) and a ground-truth reference snippet (the missing line or statement). The core requirement is that the missing statement depends on symbols—such as imported classes, called functions, or constants—not declared in the current file but available elsewhere in the same repository (Ding et al., 2023, Zhao et al., 4 Dec 2025).

To succeed, a model must ingest the prompt, optionally incorporate retrieved cross-file context (using retrieval-augmented generation or static analysis), and emit code that matches the reference under multiple evaluation criteria. Common retrieval approaches include BM25, dense vector search, dynamic graph traversal, hybrid retrieval, and static analysis–guided augmentation (Wang et al., 2024, Shah et al., 27 Sep 2025, Jiang et al., 27 Jan 2026).

An illustrative instance: given a Django management command in Python requesting to complete a function using a helper process_data imported from another file, the model must correctly synthesize the invocation based on available context and project structure (Zhao et al., 4 Dec 2025).

3. Evaluation Metrics

CrossCodeEval employs a strict and multifaceted evaluation protocol, with four principal metrics—computed per sample and aggregated across the dataset (Ding et al., 2023, Wang et al., 2024, Liang et al., 2024, Wang et al., 30 Jan 2026, Deng et al., 28 Jul 2025):

  • Exact Match (EM):

EM=1Ni=1N1[y^i=yi]\mathrm{EM} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[\hat y_i = y_i]

The prediction must be byte-for-byte identical to the reference, including whitespace and symbols.

  • Edit Similarity (ES):

ES=1Ni=1N(1LevDist(y^i,yi)max(y^i,yi))\mathrm{ES} = \frac{1}{N}\sum_{i=1}^N \left(1 - \frac{\mathrm{LevDist}(\hat y_i,\,y_i)}{\max(|\hat y_i|,\,|y_i|)}\right)

Normalized Levenshtein distance reflects partially correct outputs.

  • Identifier Exact Match (ID-EM):

ID_EM=1Ni=1N1[id(y^i)=id(yi)]\mathrm{ID\_EM} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[\operatorname{id}(\hat y_i) = \operatorname{id}(y_i)]

Requires all predicted identifiers (variable/class/function names) to exactly match ground truth.

  • Identifier F1 (ID-F1):

Preci=id(y^i)id(yi)id(y^i),Reci=id(y^i)id(yi)id(yi)\operatorname{Prec}_i = \frac{|\operatorname{id}(\hat y_i) \cap \operatorname{id}(y_i)|}{|\operatorname{id}(\hat y_i)|},\quad \operatorname{Rec}_i = \frac{|\operatorname{id}(\hat y_i) \cap \operatorname{id}(y_i)|}{|\operatorname{id}(y_i)|}

ID_F1=1Ni=1N2PreciReciPreci+Reci\mathrm{ID\_F1} = \frac{1}{N}\sum_{i=1}^N 2\cdot\frac{\operatorname{Prec}_i \cdot \operatorname{Rec}_i}{\operatorname{Prec}_i + \operatorname{Rec}_i}

Extended studies also report recall, pass@k, and context retrieval precision, but EM and ES remain central (Ding et al., 2023, Wang et al., 30 Jan 2026).

4. Comparative Context and Distinctive Features

CrossCodeEval is distinguished from earlier benchmarks by several design principles (Ding et al., 2023, Liu et al., 2024, Zhao et al., 4 Dec 2025):

  • Forced cross-file dependencies: All test instances mandate reference to code outside the immediate file. Simple in-file completions are excluded.
  • Multilinguality: Four languages (Python, Java, TypeScript, C♯), facilitating broad generalization studies and robust evaluation (Ding et al., 2023).
  • Real-world scale: Hundreds of repositories per language, spanning 1,000s of files.
  • High rigor in sample curation: Use of static analysis and AST-level checking prevents trivial and duplicative solutions.
  • Strict evaluation: EM is notably unforgiving, and identifier-based metrics penalize partial or semantically incorrect completions.

In contrast, RepoEval focuses on in-file or limited cross-file completions in a smaller set of large repositories, while ProjBench alters import masking strategies and is focused on internal API representation. CrossCodeEval is considered the reference standard for any method claiming cross-file code understanding (Wang et al., 2024, Deng et al., 28 Jul 2025, Chen et al., 13 Aug 2025).

5. Retrieval, Static Analysis, and System Performance

A broad array of retrieval and static analysis enhancements have been benchmarked on CrossCodeEval:

Method Retrieval Type Key Innovation Max EM (Python/Java)
BM25 Lexical (BM25) Sparse lexical ranking ~21%
RLCoder RL policy, learned retrieval Perplexity-based RL/sampling ~36–40%
RANGER Graph-based + BM25 Cypher on repo KG + lexical fusion 36.3%
RepoFuse Fusion (analogy + rationale context) Rank-truncated prompt, context condensation 28–30%
SaraCoder Hierarchical semantic & structural Entropy-pruned retrieval, identifier disambig. up to 40%
AlignCoder Query enhancement + RL retriever Sampled completions for query, RL refinement 34%
STALL⁺ Static analysis integration Prompt, decode, post-phase dependency cues 28–46% (Java), 29% (Python); >50% when combined with RAG
GrepRAG Index-free lexical + re-ranking ripgrep-command generation, identifier refinement 42–44%
CodexGraph LLM-driven, graph database (Neo4j) Iterative Cypher query, structure-based selection 27.9% (GPT-4o)
CoCo Multi-granular static extraction Project/file/function-level static context up to +20.2% EM over baseline

Across all methods, combining static analysis with RAG or graph/lexical retrieval yields the highest gains, particularly in static languages (e.g. Java) (Liu et al., 2024, Wang et al., 30 Jan 2026). Methods based purely on BM25 retrieval are consistently outperformed by approaches that inject semantic, structural, or dependency-aware context (Wang et al., 2024, Shah et al., 27 Sep 2025, Zhao et al., 4 Dec 2025).

Current SOTA for 7B–16B LLMs with advanced retrieval and static analysis achieves EM up to 44% (Java; GrepRAG) and ~35% (Python; STALL⁺ + RAG) (Wang et al., 30 Jan 2026, Liu et al., 2024), but absolute performance remains well below single-file tasks.

6. Evaluation Procedure and Statistical Analysis

Recent work also addresses multi-metric and statistically robust evaluation across CrossCodeEval’s multi-lingual splits (Ackerman et al., 30 Jan 2025). For a given system and dataset (DjD_j), metrics are paired per sample; paired t-tests and Cohen’s dd are the default tools for testing significance. Aggregate system rankings employ within-language score standardization and across-language harmonic mean p-value combination.

Visualization includes boxplots for value and rank, with statistical cliques for indistinguishable systems. Leading LLMs such as CodeLlama-13B-hf, CodeGemma-7B, and Granite-34B have been empirically ranked under this protocol, revealing significant and robust performance differences between contemporary models (Ackerman et al., 30 Jan 2025).

7. Limitations, Applications, and Future Directions

While CrossCodeEval has become the reference evaluation suite for cross-file code completion, several limitations remain explicit in the literature:

  • The benchmark is by construction zero-shot; no public training split is offered (Ding et al., 2023, Wang et al., 2024, Zhao et al., 4 Dec 2025).
  • Average snippet and prompt lengths are moderate (~15 tokens/line; prompts ~900–1,000 tokens), which stresses context retrieval but not very long-form generation (Ding et al., 2023, Liu et al., 2024).
  • Most methods report results only on Python and Java; TypeScript and C♯ have less extensive published results (Ding et al., 2023).
  • Masking policies (e.g., on imports) affect challenge severity and comparability for certain retrieval baselines (Deng et al., 28 Jul 2025).
  • Identifier-based metrics, while informative, do not fully guarantee semantic correctness; execution-based or functional correctness measures (e.g., unit-test pass rate) are rare (Wu et al., 2024).

Applications of CrossCodeEval span leaderboard evaluations, retrieval algorithm benchmarking, and ablation studies in code LLM and RAG systems. Future work cited in the literature converges on enhancing retrieval efficiency (hybrid lexical/semantic/graph retrievers), integrating program structure at finer granularity, maintaining up-to-date indices for evolving codebases, and improving static analysis for dynamic languages (Shah et al., 27 Sep 2025, Zhao et al., 4 Dec 2025, Liu et al., 2024).

CrossCodeEval remains a critical testbed for measuring progress in context-aware, repository-level code understanding and completion, establishing a high bar for both retrieval and generation components in modern code LLMs.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CrossCodeEval.