Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

XCodeEval Benchmark Overview

Updated 13 November 2025
  • XCodeEval is a comprehensive, multilingual benchmark assessing code understanding, generation, translation, retrieval, and automated repair using real competitive programming data.
  • It leverages 25 million code submissions from Codeforces and employs balanced data selection with techniques like min–max flow and geometric mean-based splits.
  • Evaluation is based on execution metrics such as pass@k, accuracy, and macro-F1, with iterative repair protocols demonstrating state-of-the-art improvements in APR.

XCodeEval is an execution-based, multilingual, multitask benchmark designed to rigorously evaluate LLMs and program repair methods on realistic code understanding, generation, translation, retrieval, and repair tasks. It supports evaluation across eleven mainstream programming languages and provides an extensible platform for empirical research at scale in program synthesis and automated program repair (APR) (Khan et al., 2023).

1. Benchmark Design and Coverage

XCodeEval was constructed from 25 million document-level code submissions (approximately 16.5 billion code+text tokens) sourced from 7,514 distinct Codeforces competitive programming problems. The benchmark emphasizes realism and linguistic breadth, including C, C++, C#, Go, Java, JavaScript, Kotlin, PHP, Python, Ruby, and Rust. Each problem is annotated with a natural-language description, input/output specifications, sample I/O pairs, algorithmic tags (e.g., “math”, “dp”, “graphs”), a difficulty score (ranging from 800–3500), and a hidden unit test suite (∼50 tests per problem in the full dataset).

The benchmark includes seven discrete tasks spanning three categories:

Category Task Name Description
Classification Tag Classification Predict algorithmic tags for code/problem pairs; metric: macro-F1
Code Compilation Predict compilability of code under given runtime; metric: accuracy
Generative Program Synthesis Generate executable code from problem description; metric: pass@k
Code Translation Translate code between languages; metric: pass@k
Automatic Program Repair (APR) Repair buggy submissions to pass all tests; metric: pass@k
Retrieval NL-Code Retrieval Retrieve correct code given natural language description; metric: Acc@k
Code-Code Retrieval Retrieve logically equivalent code snippets; metric: Acc@k

For APR, XCodeEval selects short, single-file, competitive-programming–style problems emphasizing well-scoped functions with defined standard I/O behaviors. The Ruby APR subset, as applied in followup work (Akbarpour et al., 6 Nov 2025), comprises 343 buggy–fixed code pairs (6.8% of the APR validation set).

2. Sample Construction and Data Selection

The validation and test splits are tag-balanced using a geometric mean–based distribution criterion. Given a set of tags T\mathcal{T} and target split ratio γ=Dvalid/Dtest\gamma = |D_{valid}|/|D_{test}|, samples are split such that the geometric mean of per-tag ratios γT\gamma_T,

GM({γT}TT)=(TTγT)1/T,\mathrm{GM}(\{\gamma_T\}_{T\in \mathcal{T}}) = \left(\prod_{T \in \mathcal{T}} \gamma_T\right)^{1/|\mathcal{T}|},

approximates γ\gamma as closely as possible. Splits not matching tag coverage between valid and test are rejected.

To ensure balanced problem- and tag-level representation, a min–max flow circulation problem on a bipartite graph encodes bounds on sample counts per problem and tag. Integer flows are computed to yield balanced selection across tags and problem origins, resulting in a uniformly representative sample for each target size.

3. Evaluation Metrics and Methodology

All generative and repair tasks in XCodeEval employ execution-based assessment via the ExecEval engine. Primary metrics are:

  • Program Synthesis, Translation, APR: Pass@kk — expected fraction of problems solved by at least one of kk independently sampled candidate codes. For problem ii, let nn be the number of samples generated, cc the number of correct predictions:

pass@k=Eproblems[1(nck)(nk)]\text{pass@}k = \mathbb{E}_{\text{problems}} \left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right]

For deterministic, top-1 repair (e.g., RAMP under greedy decoding), pass@1 reduces to the percentage of problems for which the single generated candidate is correct.

  • Tag Classification: Macro-F1 across all tags.
  • Compilation: Accuracy.
  • Retrieval: Acc@kk — fraction of queries where at least one correct code is retrieved in top kk results.

The ExecEval engine supports 44 compiler/interpreter versions for the 11 languages, providing resource isolation through Docker and prlimit/seccomp, and returns detailed per-test outcome: COMPILATION_ERROR, RUNTIME_ERROR, TIME_LIMIT_EXCEEDED, MEMORY_LIMIT_EXCEEDED, WRONG_ANSWER, PASSED.

4. Protocols and Use in Automated Program Repair

XCodeEval's APR validation split has become a central testbed for program repair agents. In RAMP (Akbarpour et al., 6 Nov 2025), a collaborative, multi-agent approach leverages the benchmark under a strictly test-driven protocol:

  • Inputs per sample: natural-language problem description CiC_i, sample I/O SiS_i, buggy code did_i, hidden test suite Th,iT_{h,i} (with 10–20 tests), and metadata (difficulty, tags, initial bug outcome).
  • Agents:
    • Feedback Integrator produces hypotheses (self-reflection) about the bug.
    • Test Designer synthesizes a suite of 6 guiding tests.
    • Programmer Agent proposes code repairs, guided by chain-of-thought few-shot prompts.
    • Test Executor runs candidate repairs against guiding and hidden tests.
  • Loop: Each problem is processed for up to K=11K=11 repair iterations, with a single repair attempt per round. Repair terminates early if any candidate passes the hidden tests ThT_h.
  • Prompting/backbones: DeepSeek-Coder–6.7B-Instruct and Qwen2.5-Coder–7B-Instruct, with temperature and sampling tuned by agent.

The hidden tests are not available to the agent until validation, ensuring realism and preventing overfitting to public I/O examples.

5. Empirical Results and Comparative Analysis

Results from RAMP and contemporaneous baselines illustrate the challenging nature and benchmark sensitivity:

Method pass@1 (%)
Zero-Shot 24.1
Few-Shot 47.5
Self-Planning 56.0
LANTERN 61.7
ChatRepair 17.6
Self-Collab. 0.0
RAMP 67.0

RAMP achieves 67.0% pass@1 on Ruby APR, outperforming LANTERN by 5.3 points (absolute), representing an 8.6% relative improvement. The convergence profile demonstrates stabilizing performance by the fifth iteration (iteration 0: 55%, iteration 1: 60%, iteration 5: 67%), with subsequent iterations providing diminishing returns or small regressions.

Breakdown by initial bug type:

Bug Outcome Before Repair pass@1 (%)
WRONG_ANSWER (most frequent) 68.5
COMPILATION_ERROR 66.7
RUNTIME_ERROR 60.4
TIME_LIMIT_EXCEEDED 40.0

RAMP exhibits highest efficacy for wrong answers and compilation errors, with TIME_LIMIT_EXCEEDED cases being most resistant to repair.

Difficulty stratification reveals over 80% solve rate for easy tasks (difficulty < 1200), near 50% for medium (1200–1400), and under 30% for hard problems (>1400). By domain tag, perfect accuracy is obtained on geometry and string-manipulation problems; brute force, DP, math, games, and graph tasks maintain over 60% solve rates, but no success is observed on advanced, under-represented domains (bitmasks, matrix operations, matchings).

6. Implementation Strengths and Limitations

Strengths:

  • Realistic, diverse bug scenarios with explicit I/O spec; enables research on competitive-programming–style bugs not overrepresented in web-centric corpora.
  • Rich metadata supports granular analysis of performance by task, tag, and difficulty.
  • Design incorporating hidden/public test split realistically simulates human debugging workflow and mitigates overfitting risk.

Limitations:

  • Focus on small, single-file problems; large-scale, multi-file, or stateful program bugs—frequent in actual Ruby development (e.g., Rails)—are not represented.
  • Sparse coverage of specialized algorithmic domains (bitmasking, matchings), constraining generalizability.
  • Linguistic skew: several languages (notably Ruby) lack broader coverage beyond algorithmic puzzles.

A plausible implication is that results derived from XCodeEval’s APR split may not generalize to more complex, industrial settings, particularly for languages like Ruby used outside the competitive programming context.

7. Future Expansion and Research Directions

RAMP’s protocol demonstrates that XCodeEval is amenable to cross-language extension; only the test execution environment and few-shot prompts require modification for additional languages. In a small C++ trial (65 samples), RAMP achieves 32.3% pass@1 compared to LANTERN’s 23.0%, indicating portability though with reduced baseline performance.

It is suggested that future benchmarks should incorporate multi-file, repository-scale bugs and integrate richer failure types (I/O validation, code style, etc.) to bridge the gap with real-world development practice. More generally, XCodeEval’s comprehensive structure, secure/multilingual execution, and scalable metadata make it a central asset for advancing empirical research in code-centric AI, APR, and multilingual program understanding (Khan et al., 2023, Akbarpour et al., 6 Nov 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to XCodeEval Benchmark.