Identify causes of o4-mini-high’s compilation errors without tool calls

Investigate whether reinforcement learning calibration overshoot and training-time reliance on local compilation are the primary causes of the high compilation error rate observed when evaluating OpenAI o4-mini-high without tool calls, and empirically assess their impact on emitted code correctness.

Background

In their tool-free evaluation, the authors observed an unusually high number of compilation errors from o4-mini-high compared to other models. They hypothesize training-related factors—specifically RL calibration overshoot and dependence on local compilation feedback—may encourage confident but syntactically incorrect outputs.

They suggest that removing tool-call safety nets (e.g., immediate compilation feedback) exposes latent weaknesses, motivating a focused paper to verify these suspected causes and measure their contribution to observed errors.

References

In particular, in our evaluation without tool calls, o4-mini-high exhibited an unprecedented number of compilation errors. We conjecture that two factors likely contribute: (i) its reinforcement learning may have overshot calibration, encouraging confident but sometimes syntactically incorrect guesses; (ii) during training, the model may have overly relied on local compilation to auto-correct surface-level errors, reducing the learning pressure to emit flawless code.

— LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? (2506.11928 - Zheng et al., 13 Jun 2025) in Appendix Section 6 (The Impact of Tool Usage in o4-mini)

Identify causes of o4-mini-high’s compilation errors without tool calls

Background

References

Related Problems