Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts

Detailed Answer

Thorough responses based on abstracts and some paper content

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

99 tokens/sec

GPT-4o

73 tokens/sec

Gemini 2.5 Pro Pro

65 tokens/sec

o3 Pro

18 tokens/sec

GPT-4.1 Pro

66 tokens/sec

DeepSeek R1 via Azure Pro

19 tokens/sec

2000 character limit reached

LiveCodeBench Pro: Benchmarking LLMs in Competitive Programming

Last updated: June 16, 2025

LLMs ° have demonstrated impressive progress on conventional code generation tasks, sparking claims that they might match or surpass elite human performers in competitive programming. LiveCodeBench ° Pro provides a rigorous, expert-driven framework for probing these claims, offering a contamination-resistant, continuously updated benchmark curated and analyzed by international Olympiad medalists. This article presents the origins, design, methodology, and findings of LiveCodeBench Pro, strictly grounded in its 2025 peer-reviewed paper, and situates its contributions in the evolving context of code evaluation benchmarks.

Significance and Background

LLMs perform strongly on legacy code benchmarks ° such as HumanEval ° and MBPP, but the validity of these results is increasingly questioned due to pervasive data contamination, limited scenario coverage, and a focus on rote implementation over algorithmic reasoning ° or creativity (Jain et al., 12 Mar 2024 ° ). Classic benchmarks frequently feature samples present in LLM training ° corpora, and emphasize template-based or function-level tasks, failing to capture the diversity and complexity of real competitive programming challenges.

LiveCodeBench Pro addresses these shortcomings with an active, contamination-minimized dataset, real-time curation, and systematic expert annotation. By harvesting problems from ongoing competitive programming contests before any solution is publicly released—and explicitly excluding highly leaked sources such as LeetCode °—the framework ensures that evaluation reflects genuine novelty for both models and humans (Zheng et al., 13 Jun 2025 ° ). This proactive approach, together with hands-on error analysis by Olympiad medalists, supports a granular assessment of problem-solving ability beyond implementation correctness.

Foundational Concepts and Methodology

Data Curation and Real-Time Updates

LiveCodeBench Pro comprises 584 problems (as of April 25, 2025) sourced from Codeforces, ICPC, and IOI °. Problems are systematically collected as contests conclude—prior to the release of solutions, discussions, or editorials—ensuring that the test set remains unknown to both the modeling community and potential LLM ° training pipelines. LeetCode and similar platforms are purposely excluded due to the risk of training data contamination. All problems include robust, adversarially crafted test inputs, with validation by contest coordinators and expert community members (Zheng et al., 13 Jun 2025 ° ).

Expert Tagging and Annotation

Each problem undergoes manual annotation by a panel of Olympiad medalists and competitive programming experts. Annotation covers both algorithmic tags ° (e.g., Number Theory °, Dynamic Programming, Segment Tree °) and cognitive focus:

Knowledge-heavy: Application of established tools or templates.
Logic-heavy: Systematic derivation, casework, or stepwise reasoning.
Observation-heavy: Demands creative or "aha" insights beyond routine solution patterns.

A triple-blind adjudication process resolves discrepancies and maintains annotation integrity. This enables the construction of a taxonomy that supports fine-grained diagnostic analysis ° by skill area and problem type.

Difficulty Calibration

Problems are stratified using Codeforces-style Elo ° ratings:

Easy: Elo ≤ 2000
Medium: 2000 < Elo ≤ 3000
Hard: Elo > 3000

This calibration enables direct skill benchmarking not just by pass rate, but relative to human competitive tiers.

Distribution of Problem Types

Category	Tag	Cognitive Focus	% of Problems	Avg. Elo
Mathematics	Number Theory	Logic	13%	$1,884\pm825$
	Combinatorics	Logic	11%	$2,423\pm666$
Dynamic Programming	—	Logic	23%	$2,431\pm614$
Greedy	—	Observation	28%	$1,708\pm791$
Data Structures	Segment Tree	Knowledge	7%	$2,629\pm502$
Implementation	—	Knowledge	18%	$2,057\pm837$

Evaluation Process and Metrics

Expert-Led Error Analysis

All model-generated submissions that fail are subject to line-by-line analysis by Olympiad medalists, who record the verdict type (logic error, implementation bug, sample input failure, etc.) and identify the root cause at a fine-grained level. This process allows the benchmark to distinguish between failures rooted in incomplete reasoning versus technical mistakes, and to profile distinct error modes for LLMs and human contestants (Zheng et al., 13 Jun 2025 ° ).

Metrics

Pass@1

The primary metric reflects the proportion of problems the model solves in its first (single-shot) submission, mirroring realistic coding contest pressure.

Elo-Based Rating

To correct for skew from problem difficulty distribution, LiveCodeBench Pro calculates a model's Elo rating analogous to Codeforces, making results directly comparable to human performance. For a submission to problem $i$ (difficulty $d_i$ ), with outcome $y_i \in \{0,1\}$ (incorrect/correct), and model estimated rating $r$ :

$\pi_i(r) = \frac{1}{1 + 10^{(d_i - r)/400}}$

$\hat r = \arg\max_{r}\sum_{i=1}^n \Bigl[y_i\ln\pi_i(r) + (1-y_i)\ln(1-\pi_i(r))\Bigr] -\frac{(r-\mu_0)^2}{2\,\sigma_0^2}$

Where $\mu_0$ and $\sigma_0^2$ are prior parameters. Elo ratings allow percentile comparison with the live human leaderboard.

Pass@k ° is also reported: the proportion of problems solved within $k$ attempts, although performance patterns indicate that higher k does not bridge the difficulty gap on hard problems.

Empirical Results and Analysis

LLM vs. Human Performance

Model	Hard	Medium	Easy	Elo	Human Percentile
o4-mini-high	0%	53.5%	83.1%	2116	1.5%
Gemini 2.5 Pro	0%	25.4%	70.4%	1992	2.3%
DeepSeek ° R1	0%	9.9%	56.3%	1442	18.0%
Claude 3.7 Sonnet ° MR	0%	1.4%	36.6%	992	56.5%

The top LLM (o4-mini-high, without tool augmentation) achieves 0% Pass@1 ° on all hard problems and only 53.5% on medium, performing at the 1.5th percentile of human participants—substantially below the “Grandmaster” threshold (∼2400 Elo).
Gemini 2.5 Pro and other leading models similarly fail to solve any hard problems and do not reach champion-level status.
Multiple attempts (higher pass@k) increase performance on easier problems but do not close the gap for hard tasks.

Error Patterns

LLMs excel at technical implementation—demonstrating lower rates of syntax, runtime, or I/O errors than comparably rated humans.
On “logic-heavy” and “observation-heavy” problems, LLMs make significantly more algorithmic and conceptual errors.
Unlike human experts—whose strengths are distributed across problem types—LLMs are far more variable, showing marked drops on creative, novel, or intricate case analysis tasks.
Models frequently fail even simple sample inputs, a failure mode rarely seen among top human competitors.

Figure: Fine-grained error breakdowns reveal that LLMs accrue more conceptual and logic errors, whereas humans are more likely to falter on implementation details.

Diagnostic Insights

High LLM performance ° is associated with implementation-heavy, template-rich tasks, not with advanced reasoning or creativity.
Models “reward hack” by defaulting to familiar templates or ignoring nuanced requirements in interactive or ambiguous problems.
Observation-heavy tags are the most challenging for LLMs; here, Elo scores ° can drop well below Candidate Master thresholds.

Applications and State of the Art

LiveCodeBench Pro provides an authoritative leaderboard (www.livecodebenchpro.com), open evaluation code ° (GitHub), and the complete problem set (HuggingFace). Public results include task-wise breakdowns, Elo percentiles, and detailed failure modalities, forming a rich basis for benchmarking research and practical deployments in LLM-powered ° developer tools.

Findings demonstrate:

No LLM tested has solved a single hard (Elo > 3000) problem without tool augmentation.
All top-performing LLMs remain clustered at the top 1–2% of human competitors, but still well below highest-achieving grandmasters.
Superior performance is attributable to robust implementation, not to surpassing humans in problem-solving creativity or abstract reasoning.

Emerging Trends and Future Directions

The paper highlights several research priorities:

Automated, Adversarial Test Generation: Development of in-house, adversarially designed test suites ° extends robustness beyond public judge datasets.
Category-Weighted and Attempt-Sensitive Metrics: Applying Elo-by-category and pass@k further exposes gaps ° in core reasoning skills ° versus effective implementation.
Decoupling Model-Intrinsic and Tool-Augmented Performance: Explicitly separating tool-augmented solutions (e.g., code execution, sample-driven debugging) from native LLM reasoning ° is essential for fair benchmarking.
Addressing Overconfidence and Reward Hacking: LLMs must be engineered to recognize uncertainty and reduce the production of plausible yet incorrect solutions.
Continuous, Adversarial Benchmark Updates: As contest organizers adapt to known LLM weaknesses, robust benchmarking demands that test sets evolve to preserve relevance and diagnostic power.

Conclusion

LiveCodeBench Pro establishes a high-standard, contamination-resistant, and expertly diagnosed benchmark for evaluating the true algorithmic reasoning capabilities of code LLMs °. Despite advances in technical implementation and pass rates on certain problem classes, a substantial and clearly diagnosed gap persists between LLMs and top human grandmasters in creative, complex, and observation-heavy domains. By leveraging fine-grained annotation ° and live, adversarial evaluation, LiveCodeBench Pro provides actionable diagnostics and a transparent leaderboard, catalyzing ongoing research focused on narrowing this gap and refining the claims of human-level code intelligence °.

Speculative Note

As tool-augmented workflows become increasingly prevalent, future research in LLM evaluation ° will likely emphasize the disentanglement of intrinsic reasoning skill from improvements driven by tool assistance. The evolution of benchmarks like LiveCodeBench Pro may play a pivotal role in tracking this boundary while advancing the capabilities of future code-generation ° models.