Reasoning Depth as Primary Cause of Frontier AI Failure on FormulaOne
Ascertain whether the depth of multi-step reasoning required by the FormulaOne benchmark—comprising dynamic programming problems on tree-like graphs derived from monadic second-order (MSO) logic—is the principal factor causing frontier AI models to exhibit near-zero success and flat-line performance on these tasks.
References
We conjecture that this reasoning depth, typical of cutting edge real world research problems, is the main characteristic due to which frontier AI models ``flat-line" on FormulaOne.
— FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming
(2507.13337 - Beniamini et al., 17 Jul 2025) in Section 1, Introduction