Reasoning Depth as Primary Cause of Frontier AI Failure on FormulaOne

Ascertain whether the depth of multi-step reasoning required by the FormulaOne benchmark—comprising dynamic programming problems on tree-like graphs derived from monadic second-order (MSO) logic—is the principal factor causing frontier AI models to exhibit near-zero success and flat-line performance on these tasks.

Background

The paper introduces FormulaOne, a benchmark of dynamic programming problems on graphs generated from MSO logic, intended to measure deep algorithmic reasoning. The authors report that frontier models such as OpenAI’s o3 achieve less than 1% success on these tasks despite substantial assistance and in-distribution problem settings.

To illustrate the complexity, the authors detail a representative problem requiring at least 15 interdependent reasoning steps and argue that such depth is typical of real-world research challenges. They explicitly conjecture that this depth is the main characteristic behind current models’ failure on FormulaOne, posing an empirical and methodological question about the relationship between reasoning depth and model performance.

References

We conjecture that this reasoning depth, typical of cutting edge real world research problems, is the main characteristic due to which frontier AI models ``flat-line" on FormulaOne.

— FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming (2507.13337 - Beniamini et al., 17 Jul 2025) in Section 1, Introduction

Reasoning Depth as Primary Cause of Frontier AI Failure on FormulaOne

Background

References

Related Problems