DebugBench: LLM Debugging Benchmark

Updated 15 October 2025

DebugBench is a large-scale benchmark designed to evaluate LLM debugging using realistic buggy code in C++, Java, and Python.
It employs rigorous zero-shot evaluations and a granular bug taxonomy to measure pass rates and differentiate performance among models.
The framework informs LLM development by highlighting strengths and weaknesses in automated code repair and suggesting targeted improvements.

DebugBench is a large-scale benchmark and evaluation framework purpose-built to measure and advance the debugging capability of LLMs. It provides rigorous, quantitative assessment of automated debugging using realistic, diverse code examples and robust methodology. The benchmark systematically implants and validates bugs across three major programming languages (C++, Java, Python), encompassing a detailed bug taxonomy for granular performance analysis. DebugBench sets new standards in scale, variety, and quality assurance that directly inform the design and improvement of LLMs intended for program repair, automated debugging, and code understanding.

1. Benchmark Definition, Composition, and Bug Taxonomy

DebugBench comprises 4,253 instances systematically designed to test the debugging ability of LLMs. Each instance consists of:

A buggy code snippet in C++, Java, or Python.
A set of associated test cases allowing execution-based assessment.
Rigorous bug annotation encompassing both major and minor types.

The benchmark divides bugs into four major categories—Syntax, Reference, Logic, and Multiple—with 18 minor types such as “misused ==/=/=”, “missing colons”, “undefined methods”, “condition error”, and “double bug”. This granular taxonomy enables fine-grained measurement of model performance across distinct error modes. The source data underpinning DebugBench comes from LeetCode community contributions released after the model pretraining cutoff (June/July 2022), strictly mitigating data leakage.

Bug implantation is performed using GPT-4, with additional double, triple, and quadruple bugs generated via rule-based combination procedures. Comprehensive quality assurance includes automated filters (requiring failure on at least one test case and absence of confounding hints) and manual review by skilled programmers, ensuring realistic and valid bug scenarios throughout the benchmark.

2. Evaluation Protocols and Performance Metrics

DebugBench employs zero-shot model evaluation—no fine-tuning—using code snippets with embedded errors. Candidate LLMs are required to output fixed code for each buggy input, which is judged against the supplied test cases. The main quantitative metric is Pass Rate ( $PR$ ):

$PR = \frac{1}{n} \sum_{i=0}^n \left[ \bigwedge_{j=0}^m a_{\theta_i^*}(x_i^j) = y_i^j \right] \times 100\%$

where $a_{\theta_i^*}$ is the fixed code for bug instance $i$ , $x_i^j$ are the test inputs, and $y_i^j$ the expected outputs. An instance only passes if all corresponding test cases are correctly handled—a rigorous requirement that strongly discriminates among code repair methods.

Closed-source models such as GPT-4 and GPT-3.5-turbo achieve overall pass rates of 75.0% and 62.1%, respectively; open-source models (CodeLlama-34b, CodeLlama-34b-Instruct, BLOOM) scored 0% under comparable zero-shot conditions, revealing stark performance differences. Human pass rates (from expert programmers) remain well above closed-source model scores, setting an upper bound for LLM performance.

3. Experimental Methodology and Data Leakage Mitigation

DebugBench’s construction methodology is carefully tailored to avoid confounding artifacts and contamination. The LeetCode source data is drawn exclusively from solutions released after known LLM cutoff dates. GPT-4 is precisely prompted for each bug type, with its output checked for realism, test case failure, and absence of explicit bug hints.

Quality control includes both automated and manual procedures. Each candidate instance is automatically filtered for basic validity (triggering test failures, absence of inline clues) and then manually vetted for correctness, sensitive information, and relevance. The prevalence of multiple error scenarios—generated through systematic rule-based bug merging—provides additional complexity for more robust, comprehensive evaluation of debugging ability.

4. Analysis of Results Across Bug Types and Languages

Performance analysis with DebugBench is structured to expose detail at multiple axes: bug category, minor bug type, programming language, and error complexity (single versus multiple errors).

Syntax and Reference bugs are easiest for current LLMs, especially when runtime feedback (outputs, tracebacks) is provided.
Logic and Multiple bugs remain challenging; pass rates for these categories are notably lower and fluctuate depending on model and language.
Adding runtime feedback often improves LLM repair precision on Syntax and Reference errors, but in Logic error scenarios may introduce irrelevant detail, sometimes impairing the debugging process.

Inter-model comparison also reveals that debugging capability correlates positively with code generation performance for closed-source models (Phi-coefficient ≈ 0.1–0.3), consistent with the intuition that the underlying code understanding required for repair overlaps with generative programming proficiency.

5. Impact on LLM Development and Future Research Trajectories

DebugBench highlights clear shortfalls and opportunities in existing LLM-based debugging:

Models trained primarily for code generation manifest significant weaknesses in systematic debugging, especially for logical and multi-error scenarios.
The strong correlation between debugging and code generation abilities (modulo bug complexity) in closed-source models suggests that improvements in one area may translate to advances in the other.
The mixed effectiveness of runtime feedback calls for adaptive, context-sensitive debugging strategies in future LLM designs, wherein diagnostic signals are tuned to problem type rather than presented wholesale.

DebugBench informs practical LLM development, both by providing a standardized reference for model tuning and evaluation and by highlighting the need for more nuanced datasets and training regimes specifically crafted for debugging.

6. Comparison and Relationship to Other Benchmarks

MdEval (Liu et al., 4 Nov 2024) expands the scope of code debugging benchmarks to 18 languages and multitask debugging (APR, code review, bug identification), whereas DebugBench is focused on Python, Java, and C++ with deep coverage of bug types and error complexity. DebugBench complements repository-level benchmarks such as DI-Bench (Zhang et al., 23 Jan 2025), which address dependency inference and end-to-end code operability. DSDBench (Yang et al., 28 Mar 2025) targets multi-hop and multi-bug error tracing in data science code—an orthogonal domain emphasizing runtime logic and library interactions.

Taken together, the rise of DebugBench and its successors delineate a robust benchmarking ecosystem in code repair, automated debugging, and program understanding for LLMs.

7. Limitations and Considerations

DebugBench’s scope is primarily algorithmic problems from the LeetCode corpus, typically solved within one class and not extending to broader programming paradigms (such as front-end development or use of domain-specific libraries). While its careful data curation strongly reduces contamination risk, latent effects from model pretraining cannot be strictly ruled out. Extraction of code outputs may introduce minor artifacts, especially when models include extraneous text; custom parsing logic mitigates this effect.

The benchmark’s success at discriminating debugging capability among models and languages anchors its value, but generalization to broader software engineering challenges plausibly requires further expansion to include diverse codebases and richer interaction patterns.

DebugBench represents a definitive, modern instrument for benchmarking and advancing automated program repair and debugging in LLMs. Its scale, bug taxonomy, methodological rigor, and quantitative standards collectively set a high bar for future research and development in AI-assisted debugging.

PDF Markdown Chat (Pro)

References (3)

MdEval: Massively Multilingual Code Debugging (2024)

DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale (2025)

Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors (2025)

DebugBench: LLM Debugging Benchmark

1. Benchmark Definition, Composition, and Bug Taxonomy

2. Evaluation Protocols and Performance Metrics

3. Experimental Methodology and Data Leakage Mitigation

4. Analysis of Results Across Bug Types and Languages

5. Impact on LLM Development and Future Research Trajectories

6. Comparison and Relationship to Other Benchmarks

7. Limitations and Considerations

Whiteboard

Follow Topic

Continue Learning

DebugBench: LLM Debugging Benchmark

1. Benchmark Definition, Composition, and Bug Taxonomy

2. Evaluation Protocols and Performance Metrics

3. Experimental Methodology and Data Leakage Mitigation

4. Analysis of Results Across Bug Types and Languages

5. Impact on LLM Development and Future Research Trajectories

6. Comparison and Relationship to Other Benchmarks

7. Limitations and Considerations

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics