Papers
Topics
Authors
Recent
Search
2000 character limit reached

LiveCodeBench: LLM Code Benchmark

Updated 23 March 2026
  • LiveCodeBench (LCB) is a continuously updated, contamination-resistant benchmark suite designed for holistic evaluation of LLMs on code-centric tasks.
  • It collects problems in real time from competitive programming contests like LeetCode, AtCoder, and CodeForces, ensuring balanced difficulty and contamination avoidance.
  • LCB provides granular diagnostics and extensibility tools to assess syntactic and semantic weaknesses, enabling robust cross-model and human parity comparisons.

LiveCodeBench (LCB) is a continuously updated, contamination-resistant benchmark suite designed for holistic, multi-scenario evaluation of LLMs on code-centric tasks. It addresses limitations of previous datasets by systematically collecting problems from competitive programming platforms post-model cutoff, calibrating difficulty, and extending assessment beyond generation to self-repair, code execution, and test-output reasoning. LCB and its extension, LiveCodeBench Pro, provide granular diagnostics for both syntactic and semantic model weaknesses, while supporting robust cross-model and human parity comparisons (Jain et al., 2024, Zheng et al., 13 Jun 2025).

1. Motivation and Rationale

Legacy code evaluation benchmarks such as HumanEval and MBPP exhibit fundamental shortcomings: they are static (enabling data contamination), narrowly scoped (focusing on single-step natural language to code generation), and possess uneven or weakly validated test cases. Problems in these datasets often appear in model training corpora, rendering de-duplication infeasible due to paraphrasing, thus inflating reported generalization capabilities. Test coverage is shallow (HumanEval averages 12 tests/problem), and difficulty is unbalanced, complicating model comparisons (Jain et al., 2024).

LCB is constructed to overcome these limitations using four guiding principles:

  1. Live, contamination-aware collection: Problems are retrieved in real time from recurring contests, and only those released after each model’s cutoff date are used for evaluation.
  2. Holistic scenario coverage: The benchmark measures multiple axes of code reasoning—generation, debugging (self-repair), "mental execution," and test-output prediction.
  3. Vetting for quality and diversity: All problems are sourced from LeetCode, AtCoder (abc contests), and CodeForces (Div 3/4), ensuring clarity and correctness via crowd-sourced solutions.
  4. Balanced coverage and difficulty: Each problem carries an explicit Easy/Medium/Hard tag; the distribution is 35.5% Easy, 42% Medium, and 22.5% Hard.

2. Data Collection, Filtering, and Contamination Avoidance

LCB’s dataset encompasses 400 problems (May 2023–May 2024), acquired through automated scraping:

  • 181 from LeetCode (weekly/biweekly contests)
  • 210 from AtCoder abc (rating < 500)
  • 9 from CodeForces Div 3/4

Exclusion criteria systematically remove multi-answer, graphical, or interactive I/O problems. Images and unsupported markdown are stripped, and mathematical notation is standardized to plain text. Public test cases are included when available; in their absence, validated input sets are synthesized via a two-stage, LLM-driven input generator employing both random and adversarial sampling (Jain et al., 2024).

A central anti-contamination safeguard: every problem is timestamped. For any LLM M (documented cutoff C), only problems with release date D > C are evaluated. Empirical pass@1 drops for specific models on problems released post-cutoff—e.g., DeepSeek-Instruct's pass@1 falls from ~60% (May–Aug) to ~0% (post-Sep 2023)—confirming the effectiveness of this procedure (Jain et al., 2024).

3. Problem Composition and Evaluation Scenarios

LCB’s 400-problem corpus is structured as:

  • 142 Easy, 168 Medium, 90 Hard
  • Platform breakdown: AtCoder—210 (avg. 28 tests/problem), LeetCode—181 (avg. 96 tests/problem), CodeForces—9 (avg. 44 tests/problem).

All problem scenarios are Python-based, requiring either stdin or function-based interfaces. LCB supports four principal evaluation settings:

Scenario Input Output Evaluation Criterion
Code generation NL prompt, starter code/API, I/O examples Full program Functional correctness (hidden tests)
Self-repair NL prompt, incorrect code, feedback (error/TLE/I/O) Repaired code Passes final hidden tests
Code execution def f(...): ..., assert f(args) == ?? Assert with literal result Exact match
Test-output prediction NL statement, function sig., test input Output (assert statement) Exact match

Test input and output predictions require exact string match after canonical parsing; code generation and repair are scored via secret, unexposed test suites (Jain et al., 2024).

4. Methodology, Metrics, and Experimental Setup

Evaluation uses the pass@1 success rate. Formally, for NN problems: pass@1=1N∑i=1N1{model’s output passes all tests or matches ground truth}\mathrm{pass}@1 = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\{\text{model's output passes all tests or matches ground truth}\} (Jain et al., 2024).

Prompts and sampling protocols are tightly standardized per task. Models (9 base, 20 instruction-tuned, and multiple closed-access APIs: GPT-3.5/4, Claude 2/3, Gemini-Pro, Mistral-L) are evaluated with 10 sampled outputs/problem, using nucleus sampling (temperature 0.2, top_p 0.95). Generation is performed either locally (vLLM) or via official APIs; all completions and prompts are publicly released.

5. Empirical Performance and Model Comparisons

Pass@1 and Overfitting Patterns

Sharp pass@1 declines on post-cutoff items empirically confirm LCB’s resistance to contamination. For instance, DeepSeek-Instruct-33B's pass@1 on LeetCode generation drops from ~60% (pre-cutoff) to 0% (post-cutoff), whereas closed models (e.g., GPT-4-Turbo, Claude-3) do not display such a drop, suggesting no contamination (Jain et al., 2024).

Cross-scenario performance correlations are high (≥0.9), but closed models maintain absolute superiority. For LCB-Easy code generation:

Model pass@1 (%)
GPT-4-Turbo 81.9
GPT-4 74.6
Claude-3-Opus 76.5
DS-Ins-33B 56.5
Phind-34B 55.4

Open-source models lag but approach parity for easy generation tasks. Overfitting to legacy benchmarks is demonstrated: e.g., DS-Ins-1.3B scores 60% on HumanEval+ but just 26% on LCB-Easy, revealing poor generalization (Jain et al., 2024).

6. Extensibility Tools and Community Infrastructure

LCB provides an extensible evaluation toolkit:

  • Modular scrapers for LeetCode, AtCoder, CodeForces.
  • LLM-based input generators for random/adversarial test creation.
  • Scenario harnesses for auto-evaluating templates in code generation, repair, execution, and test-output prediction.
  • An extensible JSON schema for new tasks, a model registry for simple endpoint addition, and a comprehensive web UI for dataset exploration.

Users can incorporate new problems using pseudocode such as:

1
2
3
4
5
from livecodebench import ProblemCollector, TaskRunner
pc = ProblemCollector(platform='LeetCode', contest_id='weekly-324')
new_problem = pc.scrape()
new_inputs = generate_random_inputs(new_problem)
TaskRunner.register_problem(new_problem, inputs=new_inputs)

This suggests a framework designed for rapid, transparent community extension and robust reproducibility (Jain et al., 2024).

7. Implications, Extended Benchmarks, and Human Comparisons

Continuous, contamination-aware, scenario-rich benchmarks such as LCB are essential as LLMs and their training data evolve. Visualizations in LCB demonstrate that single-task metrics fail to diagnose cross-competency weaknesses in modern models; for example, models may perform well at code generation but poorly at "test-output prediction" or "mental execution," skills increasingly relevant to code synthesis and debugging (Jain et al., 2024).

LiveCodeBench Pro (LCB Pro) extends LCB to 584 problems from premier algorithmic contests (Codeforces, ICPC, IOI), filtering in real time before release of solutions or discussion threads to further deter contamination. Each problem is annotated with a detailed taxonomy of algorithmic categories and cognitive foci (Knowledge, Logic, Observation). Olympiad medalists conduct line-by-line error analyses, establishing that models excel at implementation-heavy tasks but underperform on nuanced algorithmic reasoning and case analysis. For instance, best open models achieve only 53% pass@1 on Medium problems and 0% on Hard, compared to human grandmaster performance at Elo ≥2600. Error spectra analysis indicates LLMs make far more logic and sample-input errors than human comparators but fewer implementation mistakes (Zheng et al., 13 Jun 2025).

A plausible implication is that evaluation frameworks spanning multiple code reasoning modalities and contamination-resilient pipelines are critical for meaningful LLM progress tracking and for aligning future models with human-level expertise across the competitive programming spectrum.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiveCodeBench (LCB).