Dice Question Streamline Icon: https://streamlinehq.com

Quantify the impact of tool access on o4-mini’s Codeforces Elo gap

Determine whether enabling terminal access and tool calls for the OpenAI o4-mini model primarily accounts for the approximately 400 Elo rating difference between its pass@10 performance without tools (converging around 2334) and the reported pass@k rating with tools (~2719) on Codeforces, and quantify the contribution of tool access relative to allowing multiple attempts alone.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper evaluates OpenAI’s o4-mini variants and observes substantial improvements when increasing pass@k, yet notes that even pass@10 without tools converges to ~2334, still ~400 Elo below OpenAI’s reported ~2719 score achieved with terminal access and tool calls. The authors attribute some gains to multiple attempts but suggest that tool augmentation likely explains the remaining gap.

They highlight that their evaluation setup excluded terminal access and tool calls for consistency and cost reasons, and posit that tool-enabled capabilities (local compilation, sample checking, brute-force stress testing, heuristic exploration) can materially boost performance in competitive programming settings.

References

While these gains from multiple attempts are significant, the converged rating still falls approximately 400 points short of the reported 2719. We, therefore, conjecture that the remaining difference is largely attributable to the benefits of tool calls and terminal access.

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? (2506.11928 - Zheng et al., 13 Jun 2025) in Section 3.3 (Impact of Multiple Attempts (Pass@k) on Model Performance)