ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests
(2506.04894v1)
Published 5 Jun 2025 in cs.CL
Abstract: With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of LLMs in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbf{ICPC-Eval}, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
ICPC-Eval (Xu et al., 5 Jun 2025) introduces a new benchmark designed to push the boundaries of evaluating LLMs on complex reasoning tasks, specifically within the domain of competitive programming. The paper highlights the limitations of existing benchmarks like LiveCodeBench (Jain et al., 12 Mar 2024) and CodeElo (Quan et al., 2 Jan 2025), which either lack sufficient difficulty or realistic evaluation methodologies for current state-of-the-art LLMs. Problems sourced from active coding platforms are increasingly solvable by powerful models, diminishing their discriminative power. Furthermore, existing metrics like Pass@K often fail to capture the iterative refinement process inherent in human problem-solving and common in how models are used in multi-turn interactions.
To address these issues, ICPC-Eval curates 118 challenging problems from 11 recent International Collegiate Programming Contest (ICPC) contests, including World Finals, Continental Finals, and Regional contests. This selection focuses on recency to minimize data contamination risk, representativeness of real contest difficulty and problem types, and problems that can be evaluated locally. Problems requiring essential non-textual images, interactive elements, or lacking standard solutions are filtered out. The remaining problems, including 12 that required the development of custom Special Judges (SPJs) for outputs like floating-point numbers or multiple valid solutions, are standardized into a unified LaTeX format. This curated set provides a significantly higher difficulty baseline compared to previous benchmarks, as shown by the performance gap between top models on ICPC-Eval and other benchmarks. The problems cover a wide range of advanced algorithmic domains, with many problems involving areas like Dynamic Programming, Mathematics, and Search Algorithms, posing substantial challenges even to advanced models.
A key practical contribution of ICPC-Eval is its robust test case generation and local evaluation toolkit. Recognizing the difficulty of accessing private test cases from online judges, the benchmark utilizes LLMs (specifically, Gemini 2.5 Pro) to synthesize C++ input data generators for each problem. This process involves creating two types of generators: Grand for random inputs sampled uniformly from defined ranges and Gcorner for challenging inputs focusing on edge cases and specially structured instances derived from the problem statement. To generate corresponding outputs, the pipeline uses known Accepted solutions collected from platforms like QOJ. The crucial part of the pipeline is the rigorous validation of these synthesized test cases. This is done by comparing the test case outputs against known incorrect programs (those that failed on online judges with Wrong Answer, Time Limit Exceeded, etc.). If the generated test cases fail to differentiate correct and incorrect programs as expected, the LLM is prompted to regenerate the input generators. This ensures a high degree of confidence in the test suite's ability to identify errors and enables accurate, efficient local evaluation without relying on external Online Judges.
The paper also introduces a novel evaluation metric called Refine@K, which better simulates realistic problem-solving scenarios where models can iteratively refine their solutions based on execution feedback. Unlike Pass@K, which estimates the probability of getting any correct sample out of N attempts, Refine@K measures the ability to pass tests within a budget of K interaction turns. In the first turn, the model receives the problem statement. If the generated code fails (compilation error, failure on example tests, or failure on hidden tests), the model receives specific feedback. For compilation errors, the error message is provided. For failures on example tests, mismatched outputs are shown. For failures on hidden tests, only the error type (Wrong Answer, Time Limit Exceeded, etc.) is given, mimicking real competition feedback. The model is then prompted to modify its code based on this feedback. This process is repeated for up to K turns. The Refine@K metric essentially calculates the percentage of problems where the model produced a correct solution within K refinement attempts.
The experimental results on 15 state-of-the-art LLMs (including reasoning, hybrid, and non-reasoning models) demonstrate the difficulty of ICPC-Eval. The top-performing model, o3-mini High, achieved a Refine@5 score of 28.8%, significantly lower than the performance of human medalists and even lower than the scores of these models on easier benchmarks like LiveCodeBench (e.g., o3-mini High scoring 67.4%). This confirms the challenging nature and strong discriminative power of ICPC-Eval. The evaluation also revealed that reasoning models generally benefit more from the iterative feedback loop captured by Refine@K than non-reasoning models. Non-reasoning models often showed minimal or even negative improvement with attempts in the Refine@K setting compared to simple sampling (Pass@K), suggesting they lack the reflective capabilities needed to effectively utilize feedback. The paper also found a correlation between a model's Refine@5 score and its average output length, indicating that models generating longer, potentially more detailed, thought processes tend to perform better.
In summary, ICPC-Eval provides a challenging, realistic, and locally evaluable benchmark for probing the advanced reasoning capabilities of LLMs in competitive programming. Its novel LLM-based test case generation pipeline enables practical offline assessment with robust test suites. The Refine@K metric offers a more suitable evaluation approach for reasoning models by simulating iterative debugging based on execution feedback, aligning better with real-world interactive usage. While current LLMs still lag significantly behind human experts on this benchmark, ICPC-Eval provides a valuable tool for tracking progress in complex algorithmic reasoning and code generation. The authors plan to expand the dataset in the future.