CodeElo: Competitive Code Generation Benchmark
- CodeElo is a standardized code generation benchmark that evaluates large language models on competitive programming problems using human-comparable Elo ratings.
- It integrates directly with the CodeForces API, enforcing strict, zero-false-positive judging protocols to ensure reliable and authoritative solution assessments.
- The system adapts an Elo rating protocol for contest rankings, enabling direct percentile-based comparisons between models and human competitors.
CodeElo is a standardized competition-level code generation benchmark and rating system specifically designed to evaluate LLMs on real-world programming contest problems, with human-comparable Elo ratings. Grounded in problems and judging protocols from the CodeForces competitive programming platform, CodeElo provides rigorous, zero-false-positive assessment of LLM reasoning, solution correctness, and algorithmic generalization (Quan et al., 2 Jan 2025). The benchmark is accompanied by an Elo calculation protocol directly comparable to human participants, making it the first system to enable direct percentile-based ranking of models and humans on the same contest environment.
1. Dataset Construction and Properties
The CodeElo dataset consists of all rated CodeForces contests held between May 4, 2024 and November 4, 2024, totaling 54 contests and 398 problems. Only contest divisions accessible to most models (“Div. 1+2,” Div. 2, Div. 3, and Div. 4) are retained; pure Div. 1 contests are excluded due to lack of model traction even on their easiest problems. Each problem inherits a human-derived difficulty rating (ranged 800–3500) and up to 35 algorithmic tags from the official CodeForces taxonomy, with the top 16 tags accounting for approximately 90% of instances. Problems are bucketed into easy ([800,1000)), medium ([1000,1300)), and hard ([1300,3500]) classes. On average, each problem is annotated with 3.9 tags, such as greedy, dynamic programming (DP), graphs, implementation, sorting, and trees.
By division, the dataset is summarized in the following table:
| Division | Count | Avg. Problems | Avg. Rating | Entry Rating Ceiling |
|---|---|---|---|---|
| Div 1+2 | 8 | 9.1 | 2106 | all participants |
| Div 2 | 33 | 6.5 | 1779 | ≤ 2100 |
| Div 3 | 10 | 7.5 | 1436 | ≤ 1600 |
| Div 4 | 3 | 8.3 | 1276 | ≤ 1400 |
Languages are evaluated primarily in C++ but also in Python to permit fine-grained comparison and to align with frequent human usage.
2. Judging Pipeline and Execution Environment
CodeElo employs direct integration with the CodeForces API, automatically submitting generated solutions via a bot and only marking a problem as solved if it receives “Accepted” status after passing all hidden test cases; partial or visible test suite successes are insufficient, strictly matching CodeForces judging discipline. This architecture ensures zero false positives and authoritative adjudication, especially important for problems involving special judges (≈30% of sampled problems) and interactive protocols. Submissions inherit the official platform environment: standardized compiler versions, exact time/memory constraints, and the same adversarial test suite faced by human contestants. This eliminates discrepancies arising from local simulation or mismatched system resources.
3. Elo Rating System for Code Generation
The CodeElo system adapts the Elo rating protocol to competitive programming for LLMs, ensuring direct comparability to human participants. The expected score of model A with rating against human B at is:
Rather than aggregating over individual pairwise matches, CodeElo computes the model’s rating per contest. If the model ranks -th among human competitors with known ratings , the model’s rating satisfies:
This equation is solved via binary search for . Ratings are computed independently for each contest and then averaged, lowering estimator variance by a factor of the number of contests (e.g., variance falls by $1/54$ for 54 contests). The rating scale is precisely aligned with published CodeForces percentiles: for example, a rating of 1578 corresponds to the top 10% of human contestants and 1603 marks the 90th percentile.
4. Leaderboard Outcomes and Model Competence
In the inaugural evaluation of 30 open-source and 3 proprietary models, performance differences are stark. The top proprietary LLMs, o1-mini (OpenAI) and QwQ-32B-Preview, achieve ratings of 1578 (89.2 percentile) and 1261 (63.6 percentile) respectively, with the former matching the skill of experienced human contestants. Open-source models largely fall below the 25th human percentile, with almost all scoring under 700. Detailed percentile mapping is as follows:
| Model | Elo | Percentile among humans |
|---|---|---|
| o1-mini | 1578 | 89.2 |
| QwQ-32B-Preview | 1261 | 63.6 |
| Claude-3-5-Sonnet | 710 | 24.1 |
| ChatGPT-4o | 668 | 22.2 |
| Open-source baselines | <700 | <20 |
Typical open-source LLMs do not surpass even the easiest contest problems and remain uncompetitive relative to mid-tier human programmers.
5. Algorithmic Analysis and Model Behavior Insights
Tag-specific analysis reveals that LLMs’ success is unevenly distributed across algorithmic domains. Top models attain pass@1 rates exceeding 20% on Math, Implementation, Sorting; however, dynamic programming (DP) and Trees pose substantial challenges (often <5% pass rate), even for best-in-class models. C++ usage confers a 200–300 Elo advantage across all models compared to Python due to time efficiency and closer alignment with human contest practice; more than 95% of unspecified-prompt model generations default to Python, yet humans use C++ in approximately 80% of submissions.
Model scale robustly predicts better performance; pass@n metrics (for n=1,2,4,8) increase with model size and sample count, but even with 8 samples lower-tier models rarely surpass trivial success rates. Error analysis indicates frequent “wrong answer” outcomes on adversarial edge cases, especially for DP and Trees, and time limit exceedance is common for Python implementations in restricted environments.
6. Extensions: Consensus, Semantic Triangulation, and Specialized Training
Beyond baseline evaluation, CodeElo serves as a testbed for advanced selection and training protocols. Major developments include:
- Semantic Triangulation: Applied to “inexact” CodeElo problems (allowing multiple non-equivalent correct solutions), semantic triangulation increases reliable code selection accuracy by 76% over high-confidence majority voting, with abstention-aware consensus metrics improved across the board. The method, implemented in the "just-tri-it" tool, employs non-semantics-preserving prompt transformations and program-pair bijective correspondence to decouple common correlated error modes and mechanically surface correct low-probability completions (Dai et al., 15 Nov 2025).
- Test-Time Curriculum Reinforcement Learning (TTC-RL): TTC-RL, when applied to CodeElo, doubles pass@1 performance of the Qwen3-8B LLM and raises pass@8 from 28% to 43%. The protocol involves an MDP formulation over data-selection actions, reward-driven on-policy updates, and curriculum construction using information-theoretic sampling (SIFT), all tuned for the algorithmic distribution and constraints of CodeElo problems (Hübotter et al., 6 Oct 2025).
7. Implications, Recommendations, and Directions for Future Development
CodeElo establishes for the first time a human-referenced, zero-false-positive, platform-aligned benchmarking protocol for code generation models, providing a comprehensive quantitative and qualitative account of current LLM abilities and limitations. Its findings highlight the promise of recent high-performing proprietary reasoning models but also underscore the substantial accuracy gap relative to broader human proficiency—most models underperform even on basic competitive problems.
Key recommendations arising from CodeElo include:
- Design algorithm-centric and chain-of-thought focused curricula, particularly for difficult domains such as DP and Trees
- Prefer C++ output in competitive settings to harness maximal model and evaluation efficiency
- Continue monthly ingestion of new contest data to track progress and further suppress variance
- Expand to multi-language and adaptive problem selection for broader skill assessment and fair comparison
Planned extensions involve the release of open-source scaffolds for reproducible submission and judgment, as well as techniques for incorporating domain-specific and hybrid consensus methodologies. CodeElo thus provides a foundational infrastructure for rigorous, ongoing measurement and improvement of LLM code reasoning at competitive human levels (Quan et al., 2 Jan 2025, Dai et al., 15 Nov 2025, Hübotter et al., 6 Oct 2025).