Quantify the impact of tool access on o4-mini’s Codeforces Elo gap
Determine whether enabling terminal access and tool calls for the OpenAI o4-mini model primarily accounts for the approximately 400 Elo rating difference between its pass@10 performance without tools (converging around 2334) and the reported pass@k rating with tools (~2719) on Codeforces, and quantify the contribution of tool access relative to allowing multiple attempts alone.
References
While these gains from multiple attempts are significant, the converged rating still falls approximately 400 points short of the reported 2719. We, therefore, conjecture that the remaining difference is largely attributable to the benefits of tool calls and terminal access.
— LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?
(2506.11928 - Zheng et al., 13 Jun 2025) in Section 3.3 (Impact of Multiple Attempts (Pass@k) on Model Performance)