LiveCodeBench: LLM Code Benchmark
- LiveCodeBench (LCB) is a continuously updated, contamination-resistant benchmark suite designed for holistic evaluation of LLMs on code-centric tasks.
- It collects problems in real time from competitive programming contests like LeetCode, AtCoder, and CodeForces, ensuring balanced difficulty and contamination avoidance.
- LCB provides granular diagnostics and extensibility tools to assess syntactic and semantic weaknesses, enabling robust cross-model and human parity comparisons.
LiveCodeBench (LCB) is a continuously updated, contamination-resistant benchmark suite designed for holistic, multi-scenario evaluation of LLMs on code-centric tasks. It addresses limitations of previous datasets by systematically collecting problems from competitive programming platforms post-model cutoff, calibrating difficulty, and extending assessment beyond generation to self-repair, code execution, and test-output reasoning. LCB and its extension, LiveCodeBench Pro, provide granular diagnostics for both syntactic and semantic model weaknesses, while supporting robust cross-model and human parity comparisons (Jain et al., 2024, Zheng et al., 13 Jun 2025).
1. Motivation and Rationale
Legacy code evaluation benchmarks such as HumanEval and MBPP exhibit fundamental shortcomings: they are static (enabling data contamination), narrowly scoped (focusing on single-step natural language to code generation), and possess uneven or weakly validated test cases. Problems in these datasets often appear in model training corpora, rendering de-duplication infeasible due to paraphrasing, thus inflating reported generalization capabilities. Test coverage is shallow (HumanEval averages 12 tests/problem), and difficulty is unbalanced, complicating model comparisons (Jain et al., 2024).
LCB is constructed to overcome these limitations using four guiding principles:
- Live, contamination-aware collection: Problems are retrieved in real time from recurring contests, and only those released after each model’s cutoff date are used for evaluation.
- Holistic scenario coverage: The benchmark measures multiple axes of code reasoning—generation, debugging (self-repair), "mental execution," and test-output prediction.
- Vetting for quality and diversity: All problems are sourced from LeetCode, AtCoder (abc contests), and CodeForces (Div 3/4), ensuring clarity and correctness via crowd-sourced solutions.
- Balanced coverage and difficulty: Each problem carries an explicit Easy/Medium/Hard tag; the distribution is 35.5% Easy, 42% Medium, and 22.5% Hard.
2. Data Collection, Filtering, and Contamination Avoidance
LCB’s dataset encompasses 400 problems (May 2023–May 2024), acquired through automated scraping:
- 181 from LeetCode (weekly/biweekly contests)
- 210 from AtCoder abc (rating < 500)
- 9 from CodeForces Div 3/4
Exclusion criteria systematically remove multi-answer, graphical, or interactive I/O problems. Images and unsupported markdown are stripped, and mathematical notation is standardized to plain text. Public test cases are included when available; in their absence, validated input sets are synthesized via a two-stage, LLM-driven input generator employing both random and adversarial sampling (Jain et al., 2024).
A central anti-contamination safeguard: every problem is timestamped. For any LLM M (documented cutoff C), only problems with release date D > C are evaluated. Empirical pass@1 drops for specific models on problems released post-cutoff—e.g., DeepSeek-Instruct's pass@1 falls from ~60% (May–Aug) to ~0% (post-Sep 2023)—confirming the effectiveness of this procedure (Jain et al., 2024).
3. Problem Composition and Evaluation Scenarios
LCB’s 400-problem corpus is structured as:
- 142 Easy, 168 Medium, 90 Hard
- Platform breakdown: AtCoder—210 (avg. 28 tests/problem), LeetCode—181 (avg. 96 tests/problem), CodeForces—9 (avg. 44 tests/problem).
All problem scenarios are Python-based, requiring either stdin or function-based interfaces. LCB supports four principal evaluation settings:
| Scenario | Input | Output | Evaluation Criterion |
|---|---|---|---|
| Code generation | NL prompt, starter code/API, I/O examples | Full program | Functional correctness (hidden tests) |
| Self-repair | NL prompt, incorrect code, feedback (error/TLE/I/O) | Repaired code | Passes final hidden tests |
| Code execution | def f(...): ..., assert f(args) == ?? | Assert with literal result | Exact match |
| Test-output prediction | NL statement, function sig., test input | Output (assert statement) | Exact match |
Test input and output predictions require exact string match after canonical parsing; code generation and repair are scored via secret, unexposed test suites (Jain et al., 2024).
4. Methodology, Metrics, and Experimental Setup
Evaluation uses the pass@1 success rate. Formally, for problems: (Jain et al., 2024).
Prompts and sampling protocols are tightly standardized per task. Models (9 base, 20 instruction-tuned, and multiple closed-access APIs: GPT-3.5/4, Claude 2/3, Gemini-Pro, Mistral-L) are evaluated with 10 sampled outputs/problem, using nucleus sampling (temperature 0.2, top_p 0.95). Generation is performed either locally (vLLM) or via official APIs; all completions and prompts are publicly released.
5. Empirical Performance and Model Comparisons
Pass@1 and Overfitting Patterns
Sharp pass@1 declines on post-cutoff items empirically confirm LCB’s resistance to contamination. For instance, DeepSeek-Instruct-33B's pass@1 on LeetCode generation drops from ~60% (pre-cutoff) to 0% (post-cutoff), whereas closed models (e.g., GPT-4-Turbo, Claude-3) do not display such a drop, suggesting no contamination (Jain et al., 2024).
Cross-scenario performance correlations are high (≥0.9), but closed models maintain absolute superiority. For LCB-Easy code generation:
| Model | pass@1 (%) |
|---|---|
| GPT-4-Turbo | 81.9 |
| GPT-4 | 74.6 |
| Claude-3-Opus | 76.5 |
| DS-Ins-33B | 56.5 |
| Phind-34B | 55.4 |
Open-source models lag but approach parity for easy generation tasks. Overfitting to legacy benchmarks is demonstrated: e.g., DS-Ins-1.3B scores 60% on HumanEval+ but just 26% on LCB-Easy, revealing poor generalization (Jain et al., 2024).
6. Extensibility Tools and Community Infrastructure
LCB provides an extensible evaluation toolkit:
- Modular scrapers for LeetCode, AtCoder, CodeForces.
- LLM-based input generators for random/adversarial test creation.
- Scenario harnesses for auto-evaluating templates in code generation, repair, execution, and test-output prediction.
- An extensible JSON schema for new tasks, a model registry for simple endpoint addition, and a comprehensive web UI for dataset exploration.
Users can incorporate new problems using pseudocode such as:
1 2 3 4 5 |
from livecodebench import ProblemCollector, TaskRunner pc = ProblemCollector(platform='LeetCode', contest_id='weekly-324') new_problem = pc.scrape() new_inputs = generate_random_inputs(new_problem) TaskRunner.register_problem(new_problem, inputs=new_inputs) |
This suggests a framework designed for rapid, transparent community extension and robust reproducibility (Jain et al., 2024).
7. Implications, Extended Benchmarks, and Human Comparisons
Continuous, contamination-aware, scenario-rich benchmarks such as LCB are essential as LLMs and their training data evolve. Visualizations in LCB demonstrate that single-task metrics fail to diagnose cross-competency weaknesses in modern models; for example, models may perform well at code generation but poorly at "test-output prediction" or "mental execution," skills increasingly relevant to code synthesis and debugging (Jain et al., 2024).
LiveCodeBench Pro (LCB Pro) extends LCB to 584 problems from premier algorithmic contests (Codeforces, ICPC, IOI), filtering in real time before release of solutions or discussion threads to further deter contamination. Each problem is annotated with a detailed taxonomy of algorithmic categories and cognitive foci (Knowledge, Logic, Observation). Olympiad medalists conduct line-by-line error analyses, establishing that models excel at implementation-heavy tasks but underperform on nuanced algorithmic reasoning and case analysis. For instance, best open models achieve only 53% pass@1 on Medium problems and 0% on Hard, compared to human grandmaster performance at Elo ≥2600. Error spectra analysis indicates LLMs make far more logic and sample-input errors than human comparators but fewer implementation mistakes (Zheng et al., 13 Jun 2025).
A plausible implication is that evaluation frameworks spanning multiple code reasoning modalities and contamination-resilient pipelines are critical for meaningful LLM progress tracking and for aligning future models with human-level expertise across the competitive programming spectrum.