LiveCodeBench v5/v6/Pro Benchmarks
- LiveCodeBench v5/v6/Pro is a family of continuously updated, contest-style benchmarks designed to assess large language models on competitive programming tasks with varying complexity.
- The benchmarks use strict evaluation protocols and pass@k metrics, integrating human calibration and detailed diagnostic labeling to measure algorithmic and cognitive performance.
- Advanced methods such as cascaded reinforcement learning and dynamic tool integration drive performance gains, while exposing persistent challenges in complex, observation-heavy problems.
LiveCodeBench v5, v6, and Pro constitute a family of rigorous, continuously updated benchmarks for evaluating LLMs on competitive programming tasks at varying levels of algorithmic sophistication. These benchmarks cover diverse problem formats, input-output protocols, and cognitive demands, and are central to state-of-the-art studies in code-centric LLM reasoning. Distinctions across versions reflect advances in problem selection, evaluation protocols, diagnostic depth, and human calibration.
1. Benchmark Genesis, Structure, and Problem Sources
LiveCodeBench v5 and v6 are constructed as “contest-style” benchmarks, each drawing problems from established competitive programming platforms such as LeetCode, AtCoder, and Codeforces. The v5 benchmark (August 2024–February 2025, 279 problems) and v6 benchmark (August 2024–May 2025, 454 problems) mark major collection windows spanning recent real-world contests, with a strict contamination-free policy (i.e., problems are verified as unseen during model training) (Wang et al., 15 Dec 2025).
LiveCodeBench v6 comprises approximately 175 tasks in typical reported evaluations, stratified by difficulty (75 Easy, 75 Medium, 25 Hard). Task taxonomy follows a LeetCode-based categorization: Array, Dynamic Programming (DP), Graph, Greedy, Hash Map, Math, Simulation, and String. Each task employs standard input/output via stdin/stdout, includes hidden test suites to assess correctness on edge cases and enforce competition-grade time/space bounds, and features various “real-world” input formatting scenarios (Zhang et al., 12 Aug 2025).
LiveCodeBench Pro, introduced in 2025, extends the benchmark paradigm by integrating dynamic, real-time problem ingestion from premier competitions (Codeforces rated rounds, ICPC, IOI, MITIT, THUPC). Each Pro problem passes multi-layer human vetting and annotation by Olympiad medalists, exposes a complete test suite after contest resolution, and is mapped to a difficulty-aware Elo rating. Problems are tagged for algorithmic paradigm and “cognitive focus” (knowledge-heavy, logic-heavy, or observation-heavy), supporting precise analysis of LLM capabilities and deficits (Zheng et al., 13 Jun 2025).
2. Evaluation Protocols and Metric Definitions
The canonical metric for LiveCodeBench is pass@k, defined as the probability that at least one of k sampled model responses solves the task (passes all hidden tests). For v5/v6/Pro, primary results are reported with pass@1 or pass@8, depending on generation settings:
- pass@1: , where is the total number of problems. A solution passes only if it produces correct outputs on all hidden cases (up to 128 per problem in v6), with any runtime error, crash, or time/space limit exceeded counted as a failure (Zhang et al., 12 Aug 2025).
- For pass@k, the general formula (when using n samples and c correct in n draws) is (CodeGrad reports k=1) (Zhang et al., 12 Aug 2025).
- In Pro, pass@k is complemented by Elo-based calibration: for each model, the MAP estimate of its rating is derived via maximization of the log-likelihood of observed accepts/rejects against problem-specific official Elo difficulties (Zheng et al., 13 Jun 2025).
Evaluation enforces single-language (commonly C++ or Python) prompt environments, disables tool calls by default (unless evaluating tool-augmented settings), and uses only the public post-contest or official hidden test suites.
3. Diagnostic Features and Annotation Procedures (Pro)
LiveCodeBench Pro incorporates specialized annotation and analysis mechanisms:
- Olympiad medalists tag each problem by both principal algorithmic technique and cognitive focus (e.g., whether success depends on stepwise logic, prior technical knowledge, or “aha”-style insight).
- For model and human failures, line-by-line diagnostic labeling is performed. Each failure is categorized as an idea-level (conceptual/algorithmic) or implementation-level (coding/formatting) error. Special flagging occurs for “Fails Sample” errors—where the submission fails even on the sample input from the problem statement. This enables fine-grained comparison between LLM and human error profiles (Zheng et al., 13 Jun 2025).
- These diagnostics expose qualitative differences: LLMs exhibit a high ratio of idea-level to implementation-level failures (34:1 versus human solvers), with frequent confidently incorrect algorithmic justifications, especially in observation-heavy and case-based tasks.
4. Model Performance Across Versions
Recent evaluations provide key quantitative baselines:
- On v5, DeepSeek-R1-0528 achieves 74.8% pass@8, while Nemotron-Cascade-14B-Thinking achieves 77.5% (+2.7 pp absolute); on v6, these scores are 73.3% and 74.6%, respectively (+1.3 pp) (Wang et al., 15 Dec 2025).
- On LiveCodeBench Pro, Easy problems yield 68.9% pass@8 (Nemotron-Cascade-14B) vs. 63.9% (DeepSeek-R1), whereas Med problems see only 10.5% pass@8, reflecting the steep challenge gradient (Table below).
| Model | v5 (avg@8) | v6 (avg@8) | Pro Easy (avg@8) | Pro Med (avg@8) |
|---|---|---|---|---|
| DeepSeek-R1-0528 (671B) | 74.8% | 73.3% | 63.9% | 7.0% |
| Nemotron-Cascade-14B-Thinking | 77.5% | 74.6% | 68.9% | 10.5% |
| Δ | +2.7 pp | +1.3 pp | +4.8 pp | +3.5 pp |
On v6, CodeGrad’s multi-step, constrained LLM refinement yields a 40.6% relative gain over a strong forward-only baseline (0.201 vs. 0.143 pass@1), especially in categories like Simulation (+152%) and Math (+120%) and for Medium-difficulty problems (0.019 → 0.115, +505%). However, even frontier models achieve near-0% success on Hard Pro problems, where expert humans routinely succeed (Zhang et al., 12 Aug 2025, Zheng et al., 13 Jun 2025).
5. Technical Advances in LCB-Oriented Model Development
Recent model development for LiveCodeBench emphasizes several components:
- Cascaded Domain-wise RL: Models like Nemotron-Cascade apply staged fine-tuning and RL, sequentially specializing for human alignment (RLHF), instruction-following (IF-RL), mathematical reasoning (Math RL), code generation (Code RL), and code repair (SWE RL). The on-policy Group Relative Policy Optimization (GRPO) algorithm with group-normalized rewards balances stagewise learning (Wang et al., 15 Dec 2025).
- Isolation of Domains: By partitioning RL, curricula and verifiers become domain-specific, reducing catastrophic forgetting and tuning overhead. Each RL stage (e.g., Math RL, Code RL) delivers distinct gains (e.g., Math RL +2.2 pp on v6, Code RL +4 pp on Med/Hard) without degrading earlier domain performance.
- Tool Integration and Ablation: External tool support (compilation, local checking, brute-force solvers) produces a ~600 Elo improvement on Codeforces-style tasks but reveals that such gains are orthogonal to “pure” reasoning. This suggests current LLMs’ high performance is driven largely by test-passing heuristics refined by RL, not intrinsic algorithmic insight (Zheng et al., 13 Jun 2025).
- Efficiency Enhancements: Use of asynchronous parallel verification (VeRL) reduces batch evaluation time on v5/v6/Pro and enables scaling to evaluation regimes with long reasoning (up to 64 K tokens per prompt) (Wang et al., 15 Dec 2025).
6. Cognitive and Algorithmic Coverage
LiveCodeBench (esp. Pro) is stratified by both algorithmic and cognitive focus:
- Knowledge-heavy: template-based tasks and standard data structure use (segment trees, implementation).
- Logic-heavy: tasks demanding explicit derivation (DP, combinatorics, number theory).
- Observation-heavy: “aha” problems requiring subtle pattern recognition or case analysis.
Empirical results indicate LLMs are most successful in knowledge- and logic-heavy styles but deficient in observation-heavy paradigms, where even top models produce confidently incorrect arguments and miss crucial test cases—a trend not evident in expert human solvers (Zheng et al., 13 Jun 2025).
7. Prospects and Limitations
Although LiveCodeBench v5/v6/Pro have catalyzed rapid algorithmic LLM advances, several limitations persist:
- Even with domain-specialized RL and staged curricula, model performance on Pro Hard problems remains near 0%, highlighting a significant gap to expert human reasoning.
- LLMs’ weaknesses cluster around conceptual understanding and complex case decomposition; Pro’s diagnostic labeling system implies targeted augmentation (e.g., chain-of-thought prompting, interactive reasoning architectures) is required to close the human-model gap.
- The “live,” contamination-free nature and continuous expert re-annotation of Pro ensure that this gap reflects true generalization rather than memorization or distribution leakage (Zheng et al., 13 Jun 2025).
LiveCodeBench v5/v6/Pro provide a rigorous, evolving standard for code-focused LLM evaluation, supporting both benchmarking and the identification of precise failure, with implications for future research directions in model architecture and evaluation methodology (Zhang et al., 12 Aug 2025, Zheng et al., 13 Jun 2025, Wang et al., 15 Dec 2025).