RACodeBench: Real-World Code Repair Benchmark

Updated 3 July 2026

RACodeBench is a real-world benchmark dataset and evaluation protocol that pairs user-submitted buggy code with corresponding fixes from Codeforces.
It comprises 4,200 buggy/fixed code pairs with fine-grained error annotations and stratified splits to support realistic, reproducible evaluations.
The benchmark provides actionable insights using metrics like Strict Accuracy and Test Pass Rate to compare retrieval-augmented, LLM-based, and self-repair methods.

RACodeBench is a benchmark dataset and evaluation protocol constructed to measure the effectiveness and efficiency of automated code repair techniques on realistic, real-world buggy code. Designed to overcome the limitations of prior synthetic benchmarks, RACodeBench focuses on real user-submitted defects, rich algorithmic diversity, and fine-grained error annotation, facilitating rigorous comparative evaluation and diagnostic analysis in the domain of program repair, especially for LLM-based and retrieval-augmented code repair methods (Zhao et al., 2 Sep 2025).

1. Motivation and Design Rationale

Synthetic code-repair benchmarks have historically failed to capture the complexity and distributional characteristics of real programming errors encountered by users. RACodeBench was created to bridge this gap by curating a large-scale dataset from Codeforces, a competitive programming platform with a diverse, publicly available submission history. Its design objectives include:

Representing the actual distribution of bug types and algorithmic intricacies observed in user code.
Structuring data to allow efficient, realistic assessment of retrieval-based and LLM-driven repair methods.
Enforcing strict dataset splits to prevent information leakage and enable trustworthy generalization measurements.

A plausible implication is that RACodeBench provides a more robust substrate for research in code repair than synthetic datasets, especially for benchmarking retrieval-augmented and prompt-driven repair agents (Zhao et al., 2 Sep 2025).

2. Dataset Construction and Characteristics

RACodeBench comprises 4,200 buggy/fixed code pairs selected from 1,200 Codeforces problems, each chosen for having at least three distinct user-submitted incorrect solutions later fixed by their original authors. The processing pipeline includes:

Parsing and AST Validation: Ensures that only syntactically valid fixed versions are included.
Normalization: Removes comments and standardizes identifier names, increasing the signal-to-noise ratio for pattern-based retrieval and LLM models.
Tree-based Diffing: Extracts minimal edit scripts between buggy and fixed versions to support fine-grained diagnostic tasks.

Bug types are annotated via a semi-automated process and manual curation, with the following empirical distribution:

Bug Type	Percent (%)
Off-by-one	25
Logic errors	22
Incorrect loops	18
Edge-case omission	15
Null/uninit var	10
Other	10

Algorithmic domains covered include dynamic programming (25%), greedy algorithms (20%), graph search (20%), sorting/searching (15%), number theory/combinatorics (10%), and miscellaneous paradigms (10%).

Dataset splits:

Train: 840 problems (≈2,940 pairs)
Validation: 120 problems (≈420 pairs)
Test: 240 problems (≈840 pairs)

All splits are stratified by domain and difficulty, and no problem appears simultaneously in retrieval KB and benchmark splits.

3. Evaluation Protocol and Metrics

RACodeBench measures both the correctness of repaired code and the resource cost of achieving such repairs. The evaluation protocol includes two principal accuracy metrics:

Strict Accuracy (StrictAcc): The proportion of test problems for which, among the $N$ generated repairs, at least one exactly matches all reference outputs across all $K$ test cases:

$\text{StrictAcc} = \frac{1}{P}\sum_{p=1}^P \max_{1\le j\le N}\prod_{k=1}^K \mathbf{1} \{\mathrm{exec}(\text{code}_p^j, x_{p,k}) = y_{p,k}\}$

Test Pass Rate (PassRate): The average maximum per-problem fraction of passed test cases across all repairs:

$\text{PassRate} = \frac{1}{P}\sum_{p=1}^P \max_{1\le j\le N}\frac{1}{K} \sum_{k=1}^K \mathbf{1} \{\mathrm{exec}(\text{code}_p^j, x_{p,k}) = y_{p,k}\}$

Inference cost is reported as the total number of LLM calls (spanning both retrieval and generation steps), making it possible to assess repair accuracy as a function of computational expenditure.

Baselines include:

Best-of-N: $N$ independent generations, select best by test pass.
Self-Repair: Feedback-driven sequence (generation, feedback, second attempt), total $N$ calls per problem.
Retrieval-augmented methods: Such as ReCode, with explicit allocation of LLM calls to both problem classification and repair.

4. Empirical Performance and Failure Modes

Experiments on RACodeBench with GPT-4o-mini ( $N = 8$ LLM calls) show that retrieval-augmented in-context methods like ReCode significantly outperform standard baselines at equal inference cost:

Method	Test Pass Rate (%)	Strict Accuracy (%)	Inference Calls
Best-of-N	31.09	21.25	8
Self-Repair	34.79	24.58	8
ReCode	41.06	30.41	8 (2+6)

ReCode exhibits:

Higher PassRate (+6.27 points) and StrictAcc (+5.83 points) versus strongest baseline.
Faster convergence: surpasses Best-of-N by call 4 out of 8.
Superior out-of-distribution (OOD) generalization; for instance, on AtCoder data, ReCode attains 37% PassRate with only five LLM calls, while baselines require approximately 18.

Failure analyses reveal that retrieval quality is sometimes degraded by multi-paradigm problems (e.g., graph and dynamic programming combinations), rare language constructs, and logic errors requiring mathematical insight beyond exemplar patterns.

5. Diagnostic and Practical Implications

RACodeBench provides fine-grained error annotations and a hierarchical structure, supporting diagnostic analyses such as bug type- or domain-specific breakdowns. Recommendations for usage include:

Reporting both PassRate and StrictAcc to distinguish full from partial correctness.
Always disclosing inference cost (number of LLM calls or latency).
Preventing overlap between retrieval KB and evaluation splits to simulate genuine out-of-distribution repair scenarios.
Leveraging detailed error annotation for post hoc analysis of systemic strengths and weaknesses in repair strategies.

A plausible implication is that RACodeBench enables research not only into accuracy but also into computational efficiency and robustness, given its precisely defined resource accounting and split discipline (Zhao et al., 2 Sep 2025).

RACodeBench departs from previous approaches by grounding its instances in real user behavior, curating a rich and error-diverse library surpassing the synthetic, single-function, or codelet-focused datasets common in earlier work. While HumanEval, SWE-bench, and similar benchmarks provide valuable generalization targets, they lack the algorithmic diversity, fine-grained error labeling, and realistic repair traces that RACodeBench supplies.

RACodeBench is not directly concerned with root-cause analysis, such as that benchmarked by RCABench (Nishimura et al., 2023); its focus is specifically on functional repair verified by I/O test suites. Nevertheless, the use of community-curated errors and the emphasis on fine-grained, interpretable evaluation metrics show an analogous response to similar dataset and evaluation deficiencies.

7. Limitations and Prospective Extensions

RACodeBench, while substantially advancing the realism and diagnostic value of code repair benchmarks, has certain constraints:

Coverage is limited to C++, Python, and Java, reflecting the Codeforces population.
Its design does not directly address programming tasks outside competitive programming (e.g., large multi-module systems).
The dataset is static and may require expansion as algorithms or languages evolve.

Future improvements could include augmentations for language or domain breadth, task complexity, or annotating multi-turn repair conversations for feedback-based repair strategies.

In summary, RACodeBench stands as a rigorously constructed, real-world benchmark for code repair, enabling accurate, efficient, and interpretable evaluation of LLM-based and retrieval-augmented approaches in automated program repair (Zhao et al., 2 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

ReCode: Improving LLM-based Code Repair with Fine-Grained Retrieval-Augmented Generation (2025)

RCABench: Open Benchmarking Platform for Root Cause Analysis (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RACodeBench.