XCodeEval: Multilingual Code Benchmark

Updated 13 November 2025

XCodeEval is an execution-based, multilingual benchmark that rigorously evaluates tasks like program repair, synthesis, translation, and retrieval across 11 programming languages.
It aggregates 25 million code+text pairs from competitive platforms, using detailed annotations and realistic, execution-driven evaluations to challenge large language models.
The benchmark employs a secure, Dockerized evaluation engine with granular testing metrics such as pass@k and macro-F1 to ensure dynamic, cross-language performance analysis.

XCodeEval is an execution-based, multilingual, multitask benchmark designed to rigorously evaluate code understanding, generation, translation, retrieval, and automated program repair (APR) across diverse programming languages. Originally introduced as the largest executable dataset of its kind, XCodeEval incorporates 25 million document-level coding instances spanning 11 mainstream programming languages. The benchmark is intended to test the limits of LLMs and code-centric AI systems with realistic, end-to-end tasks under program execution constraints, rather than mere textual similarity.

1. Dataset Construction and Scope

XCodeEval aggregates 25 million code+text pairs derived from 7,514 distinct Codeforces problems, generating approximately 16.5 billion tokens in total. Each problem averages around 3,300 submissions across the full language collection, which includes C, C++, C#, Go, Java, JavaScript, Kotlin, PHP, Python, Ruby, and Rust. In addition to solution code, every problem is annotated with a natural language description, I/O specification, sample I/O pairs, algorithmic tags (such as "dp" or "graphs"), a scalar difficulty rating, and a hidden suite of unit tests (typically ∼50).

For the APR task in particular, XCodeEval defines a dedicated split of 5,068 validation samples covering 11 languages. As an exemplar, the Ruby subset consists of 343 buggy–fixed code pairs (∼6.8% of the APR split), with each instance comprising a problem specification, sample I/O, buggy code, and a hidden test suite of 10–20 unit tests—simulating isolated bug-fixing scenarios as encountered in competitive programming. Metadata such as difficulty, tags, and pre-repair execution outcomes (WRONG_ANSWER, COMPILATION_ERROR, RUNTIME_ERROR, TIME_LIMIT_EXCEEDED) supports fine-grained analysis and benchmarking.

2. Task Suite and Benchmark Methodology

XCodeEval supports seven primary tasks, grouped into three broad categories:

Classification
- Tag Classification: Assigns algorithmic tags to code (Code2Tag, DesCode2Tag).
- Code Compilation: Predicts binary compilability.
Generative
- Program Synthesis: Generates full solutions from natural language descriptions and sample I/O.
- Code Translation: Translates correct code between languages, requiring executable equivalence.
- Automatic Program Repair (APR): Repairs buggy submissions given the problem context, aiming for all tests to pass.
Retrieval
- NL-Code Retrieval: Finds code solutions from descriptions in a multilingual corpus.
- Code-Code Retrieval: Locates functionally equivalent code snippets, enforced by shared unit tests.

Generative and retrieval tasks are evaluated solely on execution via the multilingual ExecEval engine, ensuring that correctness is assessed by hidden unit test outcomes. Subtask design emphasizes realistic, end-to-end evaluation. For APR, the task is modeled by providing a buggy code sample and problem statement; the model must generate a fix that passes all hidden tests, with hidden and public test splits emulating real-world scenario constraints.

3. Data Balancing and Splitting Schemes

To rigorously control for data leakage and ensure representative coverage across languages, difficulties, and algorithmic domains, XCodeEval employs two notable strategies:

Validation/Test Split via Geometric Mean: From a held-out pool of 1,354 problems, random splits are assessed by computing the per-tag ratio $\gamma_T$ of samples in valid/test, then evaluating the geometric mean $\mathrm{GM}(\{\gamma_T\}_{T\in\mathcal{T}})$ . The split with $\mathrm{GM}$ closest to the desired ratio is selected, enforcing a balanced distribution of tags across validation and test sets.
Balanced Subset Selection via Network Flow: Sample selection is formulated as a lower- and upper-bounded circulation problem on a directed graph over problems and tags, ensuring minimal and maximal representation per problem and tag within specified bounds. Integer flows define the number of samples per node, yielding a balanced, capped sample set of size $M$ .

This methodology ensures robust, representative testbeds for each task, decoupled from dataset scale or tag frequency, and supports large-scale cross-task and cross-language analyses.

4. Evaluation Engine and Metrics

Executable correctness in XCodeEval is enforced by ExecEval, a Dockerized, HTTP-based code execution engine supporting 44 runtime versions encompassing 11 languages. Each unit test run is subjected to rigorous resource isolation (prlimit on CPU/mem/process/fd; seccomp for syscall blocking). For each execution, the pipeline returns granular outcomes: COMPILATION_ERROR, RUNTIME_ERROR, TIME_LIMIT_EXCEEDED, MEMORY_LIMIT_EXCEEDED, WRONG_ANSWER, or PASSED.

The primary generative metric is pass@k, defined as the expected probability that at least one of $k$ independently sampled candidates passes all hidden tests for a problem: $\text{pass@k} = E_{\text{problems}}\left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right]$ where $n$ is the number of total samples and $c$ is the count of correct samples per problem. For deterministic single-sample (n=1, k=1) settings, including APR with RAMP, this reduces to the proportion of problems for which the top-1 candidate is fully correct. Other tasks employ macro-F1 (tag classification), accuracy (compilation), or Acc@k (retrieval).

5. Experimental Protocols and Results

For APR evaluation (specifically in Ruby), experiments employ a team of collaborative agents (as instantiated in the RAMP framework), using LLM backbones such as DeepSeek-Coder–6.7B-Instruct and Qwen2.5-Coder–7B-Instruct. The protocol allows up to 11 multi-agent, test-driven repair iterations per sample, where each candidate repair is subjected to a tailored guiding test suite before evaluation on hidden tests. Repair attempts utilize few-shot chain-of-thought prompting and strategic resource budgeting (single candidate per iteration; early stopping upon success).

Results on the 343 Ruby samples are summarized as follows:

Method	Zero-Shot	Few-Shot	Self-Planning	LANTERN	ChatRepair	Self-Collab.	RAMP
pass@1 (%)	24.1	47.5	56.0	61.7	17.6	0.0	67.0

RAMP achieves 67.0% pass@1, leading all baselines by a margin of 5.3 percentage points (relative gain ≈8.6% over LANTERN).
Convergence is rapid: the initial candidate (iteration 0; no self-reflection) solves ≈55% of problems; final pass@1 (67%) is reached within five iterations.
By pre-repair bug type, repair rates are highest for WRONG_ANSWER (68.5%), COMPILATION_ERROR (66.7%), and RUNTIME_ERROR (60.4%), with TIME_LIMIT_EXCEEDED remaining challenging (40%).

Difficulty and domain analysis reveals that RAMP perfectly solves all geometry and string-manipulation tasks (100%), achieves >60% on brute force, DP, math, games, and graphs, but fails on under-represented topics (bitmasks, matrix ops, graph matchings). Problem difficulty strongly correlates with repair success: >80% for easy ( $<$ 1200), ≈50% for medium (1200–1400), and $<$ 30% for hard ( $>$ 1400).

Scalability and efficiency metrics, measured on a 10% Ruby subset, indicate RAMP's total end-to-end time (2.4×10⁴s) is comparable to few-shot and chat-based approaches, but substantially outperforms LANTERN in speed (5.3×10⁵s).

6. Strengths, Limitations, and Cross-Language Generalization

XCodeEval's key strengths include its coverage of real-world, competitively sourced bugs, explicit algorithmic tagging, and detailed metadata for granular evaluation. The benchmark's architecture—with public versus hidden tests—facilitates realistic, iterative, test-driven repair and guards against overfitting to known test cases.

However, the APR (and in particular Ruby) splits exhibit skew toward small, single-file algorithmic tasks, lacking the scale and complexity of multi-file or stateful projects commonly encountered in practical web development. Coverage remains sparse for advanced programming paradigms such as bitmasking, matrix operations, and graph matchings.

RAMP's agent-based protocol demonstrates that the XCodeEval methodology generalizes, with preliminary results in C++ indicating cross-language viability (32.3% pass@1 for RAMP vs. 23.0% for LANTERN). This suggests the benchmark can serve as a foundation for multi-agent, test-driven APR in under-studied languages.

A plausible implication is that future iterations of XCodeEval should extend to multi-file, repository-scale bugs—especially for languages such as Ruby—and enrich the failure mode repertoire (e.g., to include I/O validation and stylistic correctness).

7. Impact and Research Outlook

XCodeEval has established itself as the preeminent execution-based, multilingual testbed for code-centric AI research, with unique value in program repair, synthesis, translation, and retrieval. Evaluations using current SOTA LLMs and open-source models have yielded performance notably lower than on prior code benchmarks (e.g., HumanEval), underscoring the greater challenge presented by XCodeEval's tasks and data regime. The benchmark's structure—emphasizing real execution, cross-linguistic diversity, and subtle, high-fidelity data splits—enables nuanced diagnostic and progress-tracking studies in automated reasoning, multilingual code modeling, and robust bug repair.

Future work in this space is expected to involve richer chain-of-thought prompting schemes, hybrid training regimens leveraging multilingual and execution-based feedback, evaluation protocols robust to data contamination, and benchmarks encompassing more complex, repository-scale and stateful programming challenges. Collectively, XCodeEval and its associated infrastructure remain essential for driving advances in AI-driven code generation and repair (Akbarpour et al., 6 Nov 2025, Khan et al., 2023).