LiveCodeBench v5 Benchmark
- LiveCodeBench v5 is a comprehensive benchmark offering real-world contest-style problems to evaluate LLM code synthesis, debugging, and reasoning skills.
- It supports multiple evaluation scenarios—including code generation, self-repair, code execution, and test-output prediction—with rigorous contamination control.
- LCB v5 drives methodological innovations such as Guided Asymmetric Self-Play and Critique Reinforcement Learning, leading to improved state-of-the-art performance.
LiveCodeBench v5 (LCB v5) is a large-scale, continuously updated, contamination-aware benchmark for evaluating LLMs on competitive programming and code reasoning tasks. Designed to address the limitations of prior static, contamination-prone code benchmarks, it provides a rigorous, scenario-diverse evaluation environment for models with advanced code synthesis, debugging, and reasoning capabilities. LCB v5 has become a cornerstone of the literature for assessing post-2024 code LLMs and serves as the primary platform for reporting state-of-the-art contest-level coding results across a spectrum of academic and industry research.
1. Benchmark Structure and Dataset Properties
LCB v5 consists entirely of real-world contest-style problems collected between August 2024 and February 2025 from platforms including LeetCode, AtCoder, and CodeForces. Each problem is a natural-language prompt accompanied by public samples and an extensive hidden test suite (≥50 per problem), generated through a combination of generator-based randomization and adversarial case construction. This ensures broad coverage of edge cases and mitigates overfitting or data leakage.
LCB v5 expands beyond traditional code-generation evaluation by supporting four complementary scenarios:
- Code Generation: Synthesize a fully functional program from a natural-language specification and test against hidden cases.
- Self-Repair: Iterative debugging, where the model is prompted to repair its own failing solutions based on provided error feedback.
- Code Execution: Given code and an input, predict the output.
- Test-Output Prediction: Given a problem description and a test input, directly predict the output, measuring reasoning divorced from implementation.
Problems are carefully time-stamped and balanced across Easy, Medium, and Hard difficulty levels and across platforms. Dataset-wide, LCB v5 hosts 400 problems; performance metrics are typically reported on time-filtered evaluation splits to prevent contamination from model training data (Jain et al., 2024).
2. Evaluation Protocols and Metrics
LCB v5 uses stringent evaluation metrics centered on correctness against all hidden test cases. The dominant protocol samples multiple independent candidate solutions per problem, with a solution counted as correct only if it passes the entire suite.
- pass@k: Probability that at least one of the top generated solutions is fully correct. The standard unbiased estimator is:
where is the number of candidate solutions, and counts the correct ones (Jana et al., 16 Mar 2026).
- avg@k: For some experiments, especially with smaller , average pass@1 is computed over independent model runs per task; the arithmetic mean over these yields avg@k (Su et al., 11 Aug 2025).
- Single-scenario metrics: For non-synthesis tasks (e.g., test-output prediction), accuracy is reported as the proportion of correct model outputs.
Evaluation is performed in “thinking” mode (long context, temperature 0.6, top-p≈0.95-0.98, substantial token budgets up to 64K), with grading against all test cases. For contest evaluation, only solutions passing all tests are accepted; no partial credit is given.
3. Methodological Innovations Enabled by v5
LCB v5’s design enables comprehensive benchmarking for both conventional and cutting-edge algorithmic training methodologies:
- Contamination-free assessment: Strict time-based splits and 9-gram overlap filtering ensure model evaluation is free from training set leakage (Jain et al., 2024, Su et al., 11 Aug 2025).
- Multi-format support: Code generation, self-repair, code execution, and output prediction allow for holistic evaluation of both synthesis and comprehension (Jain et al., 2024, Jana et al., 16 Mar 2026).
- Test diversity and generator-based evaluation: Use of programmatic test synthesis (with generator and adversarial coverage) surface model limitations on hard or edge-case scenarios (Jain et al., 2024, Wu et al., 11 Jan 2026).
- Difficulty stratification: Enables ablations by task hardness, critical for quantifying progress in high-difficulty, real-contest settings.
This benchmark enables advanced RL methods (e.g., RLVR, Cascade RL, GPPO), critique-augmented training, multi-agent frameworks (Xolvergreen), and training/model selection grounded in rigorous, up-to-date evaluation (Jana et al., 16 Mar 2026, Wang et al., 15 Dec 2025, Hosain et al., 17 Jun 2025, Ruan et al., 26 Sep 2025, Su et al., 11 Aug 2025, Wu et al., 11 Jan 2026).
4. Comparative Results: State-of-the-Art on LCB v5
LCB v5 has been pivotal in benchmarking and driving progress for code LLMs spanning both open and closed weights, synthetic and real-data training, and a variety of RL enhancements. Notable performance results, grouped by reported metric and sample regime, include:
| Model / Methodology | Metric / Pass Rate | Training Source | Params | Reference |
|---|---|---|---|---|
| Xolvergreen (+) (o3-mini-high) | pass@1: 91.6% | Multi-agent, holistic | Proprietary | (Hosain et al., 17 Jun 2025) |
| OpenAI o4-mini (high) | pass@1: 82.8% (avg@8) | Closed | N/A | (Wang et al., 15 Dec 2025) |
| Nemotron-Cascade-14B | pass@1: 77.5% (avg@8) | Cascade RL, open | 14B | (Wang et al., 15 Dec 2025) |
| Klear-Reasoner | avg@8: 66.0% | Long CoT SFT + GPPO | 8B | (Su et al., 11 Aug 2025) |
| Critique-Coder-8B | pass@1: 60.8% (top-20) | RL + CRL, open | 8B | (Ruan et al., 26 Sep 2025) |
| X-Coder-Qwen2.5 (Synth) | avg@8: 62.9% | SFT→RL, synthetic | 7B | (Wu et al., 11 Jan 2026) |
| DeepCoder-14B | pass@1: 60.6% (top-20) | RL, open | 14B | (Ruan et al., 26 Sep 2025) |
| Qwen3-8B Baseline | pass@1: 57.5% (top-20) | SFT, open | 8B | (Ruan et al., 26 Sep 2025) |
These results demonstrate both the challenge of LCB v5 and the diversity of successful approaches. For proprietary, inference-time frameworks with memory (Xolvergreen), pass@1 reaches >90%. Advanced open-weight RL models trained with curriculum (Nemotron-Cascade) achieve nearly 78%. RL models with gradient-preserving objectives (Klear-Reasoner) and synthetic-only RL (X-Coder) surpass 60% avg@8, even with modest parameter counts and no access to human-written data (Hosain et al., 17 Jun 2025, Wang et al., 15 Dec 2025, Su et al., 11 Aug 2025, Wu et al., 11 Jan 2026).
5. Key Methodologies and Ablation-Driven Insights
LCB v5 has catalyzed new forms of RL, curriculum design, and data generation:
- Guided Asymmetric Self-Play (GASP): GASP leverages LCB v5 to define goalposts—hard, previously unsolved problems—used to ground the self-play process. Teachers generate curriculum chains of automatically constructed “lemma” and “lift” problems targeting these goalposts, guided by learnability-based rewards and diversity sampling. This approach surpasses unguided AZR by ∼2.5% pass@20 and enables models to solve previously unsolved tasks (Jana et al., 16 Mar 2026).
- Critique Reinforcement Learning (CRL): Critique-Coder augments standard RL with critique rewards on (question,solution) pairs, explicitly training for self-assessment fidelity. The optimal regime is 20% CRL data combined with 80% RL, yielding richer reasoning traces and better generalization, and achieving state-of-the-art results among similarly sized models (Ruan et al., 26 Sep 2025).
- Multi-Agent Holistic Reasoning (Xolver): Xolver applies multi-agent collaboration combined with episodic memory retrieval and iterative code refinement—dramatically elevating pass@1 and solving a broader range of task hardness (Hosain et al., 17 Jun 2025).
- Cascaded Domain-wise RL (Nemotron-Cascade): Sequential RL curriculum spanning alignment, instruction following, math, code, and software engineering tasks leads to incremental and transferable gains without catastrophic forgetting. Code RL delivers the largest incremental gain on LCB v5 (Wang et al., 15 Dec 2025).
- Fully Synthetic Data Regimes (X-Coder): Generation and use of large-scale, feature-rich synthetic data (SynthSmith) enables code LLMs to achieve real-data competitive performance—suggesting mitigation of human-data reliance is plausible at large scale (Wu et al., 11 Jan 2026).
- Long Chain-of-Thought SFT and GPPO (Klear-Reasoner): Carefully curated long-CoT traces in SFT and gradient-preserving PPO RL objectives, along with soft token-level reward shaping, accelerate learning from difficult, noisy, and negative samples (Su et al., 11 Aug 2025).
Ablation studies across these works consistently favor reward shaping (soft pass rates over hard binary success), rigorous filtering for training/evaluation signal purity, and the scaling of unique, high-difficulty tasks for sample-efficient improvement.
6. Broader Impact, Limitations, and Future Directions
LCB v5 has established a robust, contamination-controlled benchmark infrastructure for measuring progress in code reasoning LLMs. It enables open and closed-source models to be evaluated on a level field and drives methods development in RL, curriculum learning, synthetic data, and holistic tooling (Jain et al., 2024).
Significant findings include:
- Synthetic data, when sufficiently scaled and curated, can drive code RL beyond what is attainable with smaller real-world datasets (Wu et al., 11 Jan 2026).
- Critique and reflection, when integrated with RL, lead to more robust and explainable model behavior (Ruan et al., 26 Sep 2025).
- Multi-agent and memory-augmented inference is powerful for contest-level code reasoning but may require substantive compute and system integration (Hosain et al., 17 Jun 2025).
A persistent limitation is the reliance on fixed test suites—models may overfit to prevalent test generation heuristics, although the active refreshing with real contest data mitigates this risk. The ongoing expansion of LCB is expected to counteract these effects. Further, future releases are anticipated to support emergent code tasks (input generation, code summarization) and facilitate even broader generalization analysis.
References
- (Jain et al., 2024) Jain et al., "LiveCodeBench: Holistic and Contamination Free Evaluation of LLMs for Code"
- (Jana et al., 16 Mar 2026) "GASP: Guided Asymmetric Self-Play For Coding LLMs"
- (Ruan et al., 26 Sep 2025) "Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning"
- (Hosain et al., 17 Jun 2025) "Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team"
- (Wu et al., 11 Jan 2026) "X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests"
- (Wang et al., 15 Dec 2025) "Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models"
- (Su et al., 11 Aug 2025) "Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization"