Identify causes of o4-mini-high’s compilation errors without tool calls
Investigate whether reinforcement learning calibration overshoot and training-time reliance on local compilation are the primary causes of the high compilation error rate observed when evaluating OpenAI o4-mini-high without tool calls, and empirically assess their impact on emitted code correctness.
References
In particular, in our evaluation without tool calls, o4-mini-high exhibited an unprecedented number of compilation errors. We conjecture that two factors likely contribute: (i) its reinforcement learning may have overshot calibration, encouraging confident but sometimes syntactically incorrect guesses; (ii) during training, the model may have overly relied on local compilation to auto-correct surface-level errors, reducing the learning pressure to emit flawless code.