Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
89 tokens/sec
Gemini 2.5 Pro Premium
41 tokens/sec
GPT-5 Medium
23 tokens/sec
GPT-5 High Premium
19 tokens/sec
GPT-4o
96 tokens/sec
DeepSeek R1 via Azure Premium
88 tokens/sec
GPT OSS 120B via Groq Premium
467 tokens/sec
Kimi K2 via Groq Premium
197 tokens/sec
2000 character limit reached

LiveCodeBench: LLM Code Evaluation Benchmark

Updated 15 August 2025
  • LiveCodeBench is a dynamic benchmark for evaluating code-centric LLM performance with continuously updated competitive programming tasks.
  • It systematically collects and filters problems from major competitive platforms to mitigate training data contamination and overfitting.
  • Its extensible toolkit and public data release foster reproducible research and community-driven advances in LLM evaluation.

LiveCodeBench is a holistic, contamination-controlled evaluation benchmark for LLMs in code-related tasks. It was created to address the limitations of legacy benchmarks by offering a continuously updated, rigorously filtered, and difficulty-balanced set of competitive programming problems. LiveCodeBench systematically measures multiple code-centric model capabilities and is accompanied by an extensible toolkit and public data release for reproducibility and future extensibility.

1. Motivation and Design Rationale

The rapid evolution of LLMs in code generation and reasoning has rendered existing benchmarks, such as HumanEval and MBPP, inadequate for accurately assessing current and emerging models. These older benchmarks predominantly focus on natural language to code generation, and suffer from contamination—problems present in the training data—and overfitting by fine-tuned models. LiveCodeBench directly addresses these vulnerabilities by:

  • Continuously collecting and curating new problems as they are released, eliminating training time contamination via rigorous release-date tagging.
  • Systematically sampling challenging problems from LeetCode, AtCoder, and CodeForces, resulting in a dynamic, competitive, and up-to-date evaluation set.
  • Broadening evaluation scenarios beyond code synthesis to encompass debugging and programmatic reasoning capabilities.

This design enables a fair and evolving basis for community-wide LLM evaluation, where model improvements can be attributed to true advances in reasoning and programming, rather than memorization or overfitting to fixed datasets (Jain et al., 12 Mar 2024).

2. Data Collection and Curation Methodology

LiveCodeBench employs automated HTML scrapers to collect problems from the three major competitive programming platforms. Each instance in the dataset comprises:

  • The natural language problem statement,
  • Starter code (if present),
  • Public test cases and ground-truth solutions,
  • Contest and release metadata.

To ensure minimal contamination, problems are filtered so that only those published after a given LLM’s training cutoff are used for its evaluation. This is implemented via the "live" update mechanism, which ties each problem to its platform release date.

Additional curation steps include:

  • Exclusion of ambiguous or unsuited problems via automated and manual filters.
  • Balancing of difficulty distribution, using platform-provided ratings to cover the spectrum from easy to hard.
  • Robust test suite construction: an average of over 59 generated tests per problem, employing both random and adversarial generator-based synthesis where necessary to boost functional coverage.

These methodological choices distinguish LiveCodeBench from static benchmarks and ensure that evaluation remains robust, fair, and dynamically resistant to data leakage (Jain et al., 12 Mar 2024).

3. Evaluated Capabilities and Scenarios

LiveCodeBench is designed to assess a broader array of code-centered LLM capabilities, moving beyond conventional code synthesis:

Capability Input Modality Evaluation Metric
Code Generation Problem statement + example tests Pass@1
Self-Repair Problem statement + faulty code + error feedback Pass@1 (after repair)
Code Execution Function, input, and inference of output Output correctness
Test Output Pred. Problem statement + test input Output correctness
  • Code Generation: The model is given a problem and required to synthesize code that passes all provided tests, with Pass@1 as the principal metric (fraction of problems where at least one solution passes all tests).
  • Self-Repair: The LLM must fix its own generated code given explicit error feedback, measuring debugging capacity.
  • Code Execution: Instead of generating code, the LLM must predict the output of given code on specific inputs, approximating programmatic reasoning.
  • Test Output Prediction: The LLM predicts case results directly from the problem and input, measuring capacity for simulation and reasoning under test constraints.

This multi-scenario evaluation provides a more comprehensive lens on model abilities than prior benchmarks, which focused nearly exclusively on code writing (Jain et al., 12 Mar 2024).

4. Empirical Results and Comparative Analysis

Empirical evaluation across 18 base LLMs and 34 instruction-tuned LLMs yields several findings:

  • Contamination Sensitivity: Model performance (e.g., DeepSeek) shows a marked drop on cutoff-filtered problems, illustrating that past benchmarks likely suffer from contamination and inflation.
  • Ranking Consistency and Task Shifts: Task scores are highly correlated (often >0.9), though some models (e.g., GPT-4-Turbo, Claude-3-Opus) excel in self-repair or test output prediction more than synthetic code generation.
  • Overfitting Detection: Fine-tuned open LLMs can achieve high scores on HumanEval or HumanEval+ but perform significantly worse on LiveCodeBench, indicating overfitting to legacy data distributions.
  • Closed vs. Open Models: State-of-the-art, closed-access models consistently outperform open models across most tasks. However, fine-tuned large-parameter open LLMs, such as DS-Ins-#1B{33} or Phind-34B, can approach top-tier closed model performance.

The principal metric is:

Pass@1=Number of problems where at least one solution passes all testsTotal problems\mathrm{Pass}@_1 = \frac{\text{Number of problems where at least one solution passes all tests}}{\text{Total problems}}

This unified metric is reported across all evaluated modalities (Jain et al., 12 Mar 2024).

5. Toolkit, Data Release, and Community Integration

A central feature of LiveCodeBench is its extensible evaluation framework:

  • The benchmark is accompanied by a toolkit for adding new scenarios, models, and custom filtering logic.
  • All prompts, completions, and detailed results are publicly released, supporting reproducibility and detailed post hoc analyses.
  • The live evaluation user interface enables researchers to “scroll” through problems by release date, facilitating contamination checks and dynamic slicing of the evaluation set.
  • Researchers are encouraged to build on LiveCodeBench’s infrastructure for studying both contamination and benchmark-specific fine-tuning and for extending the ecosystem to novel tasks.

This open, flexible approach contrasts with proprietary or fixed datasets and positions LiveCodeBench as a continuously improving resource for the code LLM community (Jain et al., 12 Mar 2024).

6. Limitations, Impact, and Future Directions

While LiveCodeBench introduces significant methodological improvements, certain challenges remain:

  • The live update and problem filtering mechanism minimize but do not completely eliminate the possibility of subtle contamination, especially as model and pretraining data timeframes shift.
  • Automated test suite generation, while robust, may miss exotic corner cases unless augmented by human-in-the-loop or adversarial test generation strategies.
  • The strong correlation of sub-metrics may imply residual redundancy in the types of problems currently collected, though differential ranking shifts across tasks indicate multidimensional benchmarking value.

Nonetheless, LiveCodeBench has already contributed to the identification of overfitting in open models, exposed clear performance gaps relative to proprietary systems, and demonstrated a framework for benchmarking future code-centric LLMs as the field evolves. Its rigorous contamination controls, scenario breadth, and tooling ecosystem set a new community standard for fair and comprehensive LLM evaluation in code.

A plausible implication is that as tooling and test suite augmentation improve, LiveCodeBench will play an increasingly important role in training, RL-based fine-tuning, and robustly benchmarking LLMs in code—potentially forming the basis for more advanced, adaptive evaluation platforms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)