Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions (2508.16402v1)

Published 22 Aug 2025 in cs.SE and cs.CL

Abstract: Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of LLMs. Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces AetherCode, a benchmark sourced from elite programming contests to rigorously evaluate LLM performance.
  • It employs a hybrid approach combining automated and expert-crafted test cases to ensure 100% TPR/TNR on over 30,000 human solutions.
  • Results reveal top LLMs achieve only 35–46% Pass@1, exposing a significant gap compared to elite human programmers.

AetherCode: A Rigorous Benchmark for LLMs in Competitive Programming

Motivation and Limitations of Existing Benchmarks

The evaluation of LLMs' code reasoning capabilities has traditionally relied on benchmarks such as HumanEval, MBPP, and LiveCodeBench. While these datasets have driven progress, their limitations have become increasingly apparent as LLMs achieve near-saturation performance (e.g., >90% Pass@1 on HumanEval and MBPP). The paper identifies two primary deficiencies in these benchmarks:

  1. Insufficient Difficulty and Scope: Existing datasets predominantly feature problems that are either too elementary or lack the complexity and diversity found in premier programming competitions. Many benchmarks are sourced from platforms like LeetCode or CodeForces, which, due to their contest formats and problem selection, do not fully capture the breadth and depth of algorithmic challenges present in top-tier competitions such as IOI and ICPC.
  2. Evaluation Bias from Low-Quality Test Cases: The reliability of code evaluation is undermined by incomplete or naively generated test suites. Many benchmarks use a small set of handwritten or randomly mutated test cases, which fail to detect subtle errors or corner cases. Some recent efforts have attempted to leverage official judging services (e.g., CodeForces), but this introduces compliance and scalability issues.

These limitations result in an overestimation of LLM proficiency and obscure the substantial gap between current models and elite human programmers.

AetherCode Benchmark Design

Problem Sourcing and Curation

AetherCode systematically curates problems from the most prestigious programming competitions worldwide, including the Olympiad in Informatics (OI) series (e.g., IOI, NOI, USACO) and the International Collegiate Programming Contest (ICPC) series. The curation process involves:

  • Manual Conversion and Proofreading: Problem statements are converted from PDF to Markdown+LaTeX and manually proofread for accuracy.
  • Comprehensive Metadata: Each problem is annotated with difficulty (Easy, Medium, Hard, Extreme), contest year, competition type, scope, and algorithmic/domain tags.
  • Exclusion of Non-Standard Formats: Problems requiring visual input or special judges are either excluded or explicitly labeled.

This approach ensures that AetherCode covers a wide spectrum of algorithmic domains and problem formats, reflecting the true diversity and rigor of competitive programming.

High-Quality Test Case Construction

AetherCode introduces a hybrid methodology for test case generation:

  • Automated Generation: Utilizes the Generator-Validator Agent System to produce initial test cases, with manual verification of the validator to ensure adherence to problem constraints.
  • Expert Annotation: A team of 67 competitive programming experts, including International Grandmasters, construct targeted test cases to "hack" incorrect solutions. For problems with limited incorrect submissions, a specialized review team (ICPC gold medalists) conducts manual audits to further enhance robustness.
  • Custom Checkers: For problems with multiple valid outputs, custom judging scripts are provided and reviewed.

Test suite quality is directly assessed using a large corpus of over 30,000 human solutions (both correct and incorrect). The test suites achieve 100% True Positive Rate (TPR) and 100% True Negative Rate (TNR) on this corpus, setting a new standard for benchmark reliability.

Evaluation of LLMs on AetherCode

Experimental Setup

AetherCode evaluates both reasoning and non-reasoning LLMs, including o4-mini-high, Gemini-2.5-Pro/Flash, Seed-1.6-Thinking, DeepSeek-R1, Qwen3, GPT-4.1, GPT-4o, Kimi-K2, DeepSeek-V3, and Qwen3-Coder. Each model is tested with up to four sampling attempts per problem, and results are averaged.

Key Findings

  • Substantial Performance Gap: Even the best models (o4-mini-high, Gemini-2.5-Pro) achieve only 35–46% Pass@1 on AetherCode, with performance dropping sharply on Hard and Extreme problems. Only these two models solve any "Extreme" problems.
  • Reasoning Models Outperform Non-Reasoning Models: Reasoning models consistently surpass non-reasoning models across all difficulty levels and algorithmic domains. Non-reasoning models show limited improvement even with increased sampling (Pass@4).
  • Exploration Potential: Top models benefit more from increased sampling, indicating greater solution diversity and exploration capability.
  • Algorithmic Category Breakdown: All models perform best on "Basic Algorithms" and "Strings", but struggle with "Computational Geometry", "Tree Structures", and advanced "Dynamic Programming". The performance of non-reasoning models is particularly poor in domains requiring deep logical reasoning.

Quantitative Highlights

  • o4-mini-high: Pass@1 = 35.5%, Pass@4 = 46.6%
  • Gemini-2.5-Pro: Pass@1 = 32.7%, Pass@4 = 46.0%
  • Non-reasoning models: Pass@1 < 11% across all categories
  • Category-specific: o4-mini-high achieves 38.1% on "Basic Algorithms" but only 7.3% on "Tree Structures"

Implications and Future Directions

Practical Implications

AetherCode exposes the persistent gap between LLMs and top human programmers in competitive programming. The benchmark's rigor and diversity make it a more faithful measure of code reasoning and synthesis capabilities, and its open-source, self-contained nature facilitates reproducible and scalable evaluation. The 100% TPR/TNR test suites eliminate evaluation artifacts, ensuring that reported model performance reflects true problem-solving ability.

Theoretical Implications

The results indicate that current LLMs, even with advanced reasoning architectures, are far from mastering the algorithmic abstraction, compositionality, and error-avoidance required for elite-level programming. The sharp drop in performance on complex domains and higher difficulty levels suggests that further advances in model architecture, training data, and reasoning strategies are necessary.

Future Research Directions

  • Model Architecture: Investigate architectures that better capture algorithmic reasoning, recursion, and mathematical abstraction.
  • Training Regimes: Incorporate curriculum learning, self-play, and reinforcement learning from feedback on hard, diverse problems.
  • Evaluation Methodology: Extend AetherCode with new problem types (e.g., interactive, multi-stage, or visual problems) and longitudinal tracking of model progress.
  • Human-AI Collaboration: Explore hybrid systems where LLMs assist human programmers or vice versa, leveraging complementary strengths.

Conclusion

AetherCode establishes a new standard for evaluating LLMs in competitive programming by combining high-difficulty, diverse problems from premier contests with rigorously validated test suites. The benchmark reveals that, despite recent progress, state-of-the-art LLMs remain significantly behind top human performers, especially on complex algorithmic tasks. AetherCode will serve as a critical resource for driving future advances in code reasoning, model evaluation, and the development of more capable AI systems for programming.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.