XCodeEval Benchmark Overview

Updated 13 November 2025

XCodeEval is a comprehensive, multilingual benchmark assessing code understanding, generation, translation, retrieval, and automated repair using real competitive programming data.
It leverages 25 million code submissions from Codeforces and employs balanced data selection with techniques like min–max flow and geometric mean-based splits.
Evaluation is based on execution metrics such as pass@k, accuracy, and macro-F1, with iterative repair protocols demonstrating state-of-the-art improvements in APR.

XCodeEval is an execution-based, multilingual, multitask benchmark designed to rigorously evaluate LLMs and program repair methods on realistic code understanding, generation, translation, retrieval, and repair tasks. It supports evaluation across eleven mainstream programming languages and provides an extensible platform for empirical research at scale in program synthesis and automated program repair (APR) (Khan et al., 2023).

1. Benchmark Design and Coverage

XCodeEval was constructed from 25 million document-level code submissions (approximately 16.5 billion code+text tokens) sourced from 7,514 distinct Codeforces competitive programming problems. The benchmark emphasizes realism and linguistic breadth, including C, C++, C#, Go, Java, JavaScript, Kotlin, PHP, Python, Ruby, and Rust. Each problem is annotated with a natural-language description, input/output specifications, sample I/O pairs, algorithmic tags (e.g., “math”, “dp”, “graphs”), a difficulty score (ranging from 800–3500), and a hidden unit test suite (∼50 tests per problem in the full dataset).

The benchmark includes seven discrete tasks spanning three categories:

Category	Task Name	Description
Classification	Tag Classification	Predict algorithmic tags for code/problem pairs; metric: macro-F1
	Code Compilation	Predict compilability of code under given runtime; metric: accuracy
Generative	Program Synthesis	Generate executable code from problem description; metric: pass@k
	Code Translation	Translate code between languages; metric: pass@k
	Automatic Program Repair (APR)	Repair buggy submissions to pass all tests; metric: pass@k
Retrieval	NL-Code Retrieval	Retrieve correct code given natural language description; metric: Acc@k
	Code-Code Retrieval	Retrieve logically equivalent code snippets; metric: Acc@k

For APR, XCodeEval selects short, single-file, competitive-programming–style problems emphasizing well-scoped functions with defined standard I/O behaviors. The Ruby APR subset, as applied in followup work (Akbarpour et al., 6 Nov 2025), comprises 343 buggy–fixed code pairs (6.8% of the APR validation set).

2. Sample Construction and Data Selection

The validation and test splits are tag-balanced using a geometric mean–based distribution criterion. Given a set of tags $\mathcal{T}$ and target split ratio $\gamma = |D_{valid}|/|D_{test}|$ , samples are split such that the geometric mean of per-tag ratios $\gamma_T$ ,

$\mathrm{GM}(\{\gamma_T\}_{T\in \mathcal{T}}) = \left(\prod_{T \in \mathcal{T}} \gamma_T\right)^{1/|\mathcal{T}|},$

approximates $\gamma$ as closely as possible. Splits not matching tag coverage between valid and test are rejected.

To ensure balanced problem- and tag-level representation, a min–max flow circulation problem on a bipartite graph encodes bounds on sample counts per problem and tag. Integer flows are computed to yield balanced selection across tags and problem origins, resulting in a uniformly representative sample for each target size.

3. Evaluation Metrics and Methodology

All generative and repair tasks in XCodeEval employ execution-based assessment via the ExecEval engine. Primary metrics are:

Program Synthesis, Translation, APR: Pass@ $k$ — expected fraction of problems solved by at least one of $k$ independently sampled candidate codes. For problem $i$ , let $n$ be the number of samples generated, $c$ the number of correct predictions:

$\text{pass@}k = \mathbb{E}_{\text{problems}} \left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right]$

For deterministic, top-1 repair (e.g., RAMP under greedy decoding), pass@1 reduces to the percentage of problems for which the single generated candidate is correct.

Tag Classification: Macro-F1 across all tags.
Compilation: Accuracy.
Retrieval: Acc@ $k$ — fraction of queries where at least one correct code is retrieved in top $k$ results.

The ExecEval engine supports 44 compiler/interpreter versions for the 11 languages, providing resource isolation through Docker and prlimit/seccomp, and returns detailed per-test outcome: COMPILATION_ERROR, RUNTIME_ERROR, TIME_LIMIT_EXCEEDED, MEMORY_LIMIT_EXCEEDED, WRONG_ANSWER, PASSED.

4. Protocols and Use in Automated Program Repair

XCodeEval's APR validation split has become a central testbed for program repair agents. In RAMP (Akbarpour et al., 6 Nov 2025), a collaborative, multi-agent approach leverages the benchmark under a strictly test-driven protocol:

Inputs per sample: natural-language problem description $C_i$ , sample I/O $S_i$ , buggy code $d_i$ , hidden test suite $T_{h,i}$ (with 10–20 tests), and metadata (difficulty, tags, initial bug outcome).
Agents:
- Feedback Integrator produces hypotheses (self-reflection) about the bug.
- Test Designer synthesizes a suite of 6 guiding tests.
- Programmer Agent proposes code repairs, guided by chain-of-thought few-shot prompts.
- Test Executor runs candidate repairs against guiding and hidden tests.
Loop: Each problem is processed for up to $K=11$ repair iterations, with a single repair attempt per round. Repair terminates early if any candidate passes the hidden tests $T_h$ .
Prompting/backbones: DeepSeek-Coder–6.7B-Instruct and Qwen2.5-Coder–7B-Instruct, with temperature and sampling tuned by agent.

The hidden tests are not available to the agent until validation, ensuring realism and preventing overfitting to public I/O examples.

5. Empirical Results and Comparative Analysis

Results from RAMP and contemporaneous baselines illustrate the challenging nature and benchmark sensitivity:

Method	pass@1 (%)
Zero-Shot	24.1
Few-Shot	47.5
Self-Planning	56.0
LANTERN	61.7
ChatRepair	17.6
Self-Collab.	0.0
RAMP	67.0

RAMP achieves 67.0% pass@1 on Ruby APR, outperforming LANTERN by 5.3 points (absolute), representing an 8.6% relative improvement. The convergence profile demonstrates stabilizing performance by the fifth iteration (iteration 0: 55%, iteration 1: 60%, iteration 5: 67%), with subsequent iterations providing diminishing returns or small regressions.

Breakdown by initial bug type:

Bug Outcome Before Repair	pass@1 (%)
WRONG_ANSWER (most frequent)	68.5
COMPILATION_ERROR	66.7
RUNTIME_ERROR	60.4
TIME_LIMIT_EXCEEDED	40.0

RAMP exhibits highest efficacy for wrong answers and compilation errors, with TIME_LIMIT_EXCEEDED cases being most resistant to repair.

Difficulty stratification reveals over 80% solve rate for easy tasks (difficulty < 1200), near 50% for medium (1200–1400), and under 30% for hard problems (>1400). By domain tag, perfect accuracy is obtained on geometry and string-manipulation problems; brute force, DP, math, games, and graph tasks maintain over 60% solve rates, but no success is observed on advanced, under-represented domains (bitmasks, matrix operations, matchings).

6. Implementation Strengths and Limitations

Strengths:

Realistic, diverse bug scenarios with explicit I/O spec; enables research on competitive-programming–style bugs not overrepresented in web-centric corpora.
Rich metadata supports granular analysis of performance by task, tag, and difficulty.
Design incorporating hidden/public test split realistically simulates human debugging workflow and mitigates overfitting risk.

Limitations:

Focus on small, single-file problems; large-scale, multi-file, or stateful program bugs—frequent in actual Ruby development (e.g., Rails)—are not represented.
Sparse coverage of specialized algorithmic domains (bitmasking, matchings), constraining generalizability.
Linguistic skew: several languages (notably Ruby) lack broader coverage beyond algorithmic puzzles.

A plausible implication is that results derived from XCodeEval’s APR split may not generalize to more complex, industrial settings, particularly for languages like Ruby used outside the competitive programming context.

7. Future Expansion and Research Directions

RAMP’s protocol demonstrates that XCodeEval is amenable to cross-language extension; only the test execution environment and few-shot prompts require modification for additional languages. In a small C++ trial (65 samples), RAMP achieves 32.3% pass@1 compared to LANTERN’s 23.0%, indicating portability though with reduced baseline performance.

It is suggested that future benchmarks should incorporate multi-file, repository-scale bugs and integrate richer failure types (I/O validation, code style, etc.) to bridge the gap with real-world development practice. More generally, XCodeEval’s comprehensive structure, secure/multilingual execution, and scalable metadata make it a central asset for advancing empirical research in code-centric AI, APR, and multilingual program understanding (Khan et al., 2023, Akbarpour et al., 6 Nov 2025).