FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming (2507.13337v1)

Published 17 Jul 2025 in cs.AI, cs.CC, and math.LO

Abstract: Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human -- or superhuman -- expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly expressive framework of Monadic Second-Order (MSO) logic on graphs, paving the way toward automatic problem generation at scale; ideal for building RL environments. Third, many of our problems are intimately related to the frontier of theoretical computer science, and to central conjectures therein, such as the Strong Exponential Time Hypothesis (SETH). As such, any significant algorithmic progress on our dataset, beyond known results, could carry profound theoretical implications. Remarkably, state-of-the-art models like OpenAI's o3 fail entirely on FormulaOne, solving less than 1% of the questions, even when given 10 attempts and explanatory fewshot examples -- highlighting how far they remain from expert-level understanding in some domains. To support further research, we additionally curate FormulaOne-Warmup, offering a set of simpler tasks, from the same distribution. We release the full corpus along with a comprehensive evaluation framework.

Summary

The paper introduces a benchmark that evaluates AI's deep algorithmic reasoning beyond traditional competitive programming tasks.
It employs MSO logic and tree decompositions to generate dynamic programming problems requiring up to 15 intertwined reasoning steps.
Empirical tests reveal that state-of-the-art LLMs solve less than 1% of the hard problems, highlighting significant model limitations.

FormulaOne: A Benchmark for Deep Algorithmic Reasoning Beyond Competitive Programming

The FormulaOne benchmark introduces a rigorous and principled framework for evaluating the depth of algorithmic reasoning in AI systems, with a focus on dynamic programming over graphs defined by Monadic Second-Order (MSO) logic. The benchmark is motivated by the observation that, despite recent advances in LLMs and their strong performance on competitive programming platforms, these models remain fundamentally limited in their ability to solve problems that require multi-step, abstract, and combinatorial reasoning characteristic of real-world research and industrial optimization tasks.

Benchmark Construction and Theoretical Foundations

FormulaOne is constructed around three core principles:

Depth of Reasoning: The problems are designed to require an overview of topological, geometric, and combinatorial insights, as well as precise implementation. Unlike competitive programming problems, which often emphasize clever tricks or isolated algorithmic ideas, FormulaOne tasks demand the orchestration of numerous interdependent reasoning steps. The authors provide concrete evidence that some problems require at least 15 intertwined mathematical steps for a complete solution.
Unbounded, Principled Problem Generation: Leveraging Courcelle’s theorem, which states that any property expressible in MSO logic can be solved in linear time on graphs of bounded treewidth, the benchmark enables the semi-automatic generation of a vast and diverse set of problems. This approach allows for scalable creation of high-depth algorithmic challenges, suitable for both evaluation and reinforcement learning with verifiable rewards (RLVR).
Connection to Theoretical Computer Science: Many problems in FormulaOne are closely related to central conjectures in fine-grained complexity, such as the Strong Exponential Time Hypothesis (SETH). Any significant algorithmic progress on these problems would have direct implications for complexity theory, as the best-known algorithms are believed to be optimal under SETH.

Dataset Design and Problem Structure

The dataset consists of two main components:

FormulaOne: 120 challenging dynamic programming problems on graphs, each defined by an MSO formula and accompanied by a tree decomposition and weight function. The problems span a wide range of graph-theoretic properties, including connectivity, domination, forbidden subgraphs, and extremal structures.
FormulaOne-Warmup: 100 simpler problems from the same distribution, intended to facilitate research and model development.

Each problem is formulated as a weighted model counting (WMC) task: given a graph, its tree decomposition, and vertex weights, compute the sum of weights of all subsets satisfying the MSO-defined property, modulo a large prime. The input format and evaluation environment are standardized, with the environment handling parsing, graph representation, and tree decomposition traversal. The model’s task is restricted to implementing the core dynamic programming logic via five callback functions (leaf, introduce, forget, join, extract_solution), isolating the reasoning challenge from engineering overhead.

Complexity and State Design

The paper provides a detailed exposition of the challenges inherent in MSO-based dynamic programming on tree decompositions. Key implementation considerations include:

State Representation: Designing a minimal yet sufficient state profile for each bag in the tree decomposition is non-trivial. The state must capture all relevant information about partial solutions, including connectivity partitions, domination status, forbidden subgraph configurations, and global invariants.
Transition Logic: Correctly handling the transitions at introduce, forget, and join nodes is critical. The authors highlight common pitfalls, such as premature invalidation of states, failure to cap solution cardinality, incorrect merging of connectivity partitions, and double-counting in join operations.
Geometric and Logical Reasoning: Many problems require tracking complex geometric configurations (e.g., induced paths or cycles) and reasoning about the interaction between local and global properties. The benchmark includes problems that necessitate the use of non-trivial graph-theoretic theorems and logical reductions to achieve efficient solutions.

Evaluation Methodology

The evaluation framework is comprehensive and rigorous:

Automated Verification: Each problem is accompanied by a verifier, enabling automatic checking of candidate solutions against a suite of tests, including correctness (via brute-force on small instances), consistency (invariance to tree decomposition), and efficiency (scaling to large graphs).
Problem Annotation: Problems are labeled according to the algorithmic skills and state-design techniques required, allowing for fine-grained analysis of model performance across categories such as adjacency, connectivity, extremal properties, logic, topology, and more.
Model Assessment: Leading LLMs (OpenAI o3, o3-Pro, Gemini 2.5 Pro, Grok 4 Heavy) were evaluated under generous prompting and support. Despite this, the best models solved less than 1% of the hard problems, with only marginally better performance on the warmup set. The analysis identifies central failure modes, including premature finalization, incomplete geometric reasoning, local-to-global errors, and non-canonical state representation.

Numerical Results and Claims

Empirical Finding: All evaluated frontier models, including those that achieve superhuman performance on competitive programming benchmarks, fail almost entirely on FormulaOne, solving at most 1 out of 120 problems even with 10 attempts and extensive few-shot examples.
Claim: The gap between current model capabilities and the reasoning required for these problems is fundamental and not addressable by prompt engineering or minor architectural tweaks.
Implication: The results demonstrate that current benchmarks are insufficient for measuring the depth of algorithmic reasoning needed for real-world research and optimization tasks. FormulaOne exposes a regime where LLMs’ emergent capabilities are inadequate, and where progress may require fundamentally new approaches, such as explicit search or symbolic reasoning.

Practical and Theoretical Implications

Practical Implications:

Benchmarking: FormulaOne provides a principled, extensible, and scalable benchmark for evaluating and training AI systems on deep algorithmic reasoning tasks. Its design is well-suited for RLVR and for driving progress in automated scientific discovery.
Model Development: The benchmark highlights the need for models that can perform structured, multi-step reasoning, and that can synthesize combinatorial, geometric, and logical insights. It suggests that future systems may need to integrate symbolic methods, search, or program synthesis techniques.
Industrial Relevance: The problems are directly motivated by real-world applications in routing, scheduling, and network design, making progress on this benchmark relevant for commercial optimization.

Theoretical Implications:

Complexity Theory: Any significant improvement on the benchmark’s hard problems would have direct consequences for fine-grained complexity, potentially refuting SETH for certain classes of problems.
Open-Ended Problem Generation: The MSO-based framework enables the creation of an essentially unbounded suite of high-depth problems, supporting research into open-ended scientific discovery and automated theorem proving.

Future Directions

Expanding Problem Classes: The current release focuses on WMC objectives and MSO1 logic. Extensions to MSO2, optimization variants, and other graph parameters (e.g., clique-width, pathwidth) are natural next steps.
Integrated Tree Decomposition: Requiring models to generate tree decompositions, rather than providing them, would further increase the challenge and realism of the benchmark.
Hybrid Approaches: The results motivate research into hybrid neuro-symbolic systems, explicit search, and program synthesis as means to bridge the observed gap in reasoning depth.
RL Environments: The benchmark’s structure is ideal for constructing RL environments with verifiable rewards, supporting the development of agents capable of open-ended algorithmic discovery.

Conclusion

FormulaOne establishes a new standard for evaluating and advancing the algorithmic reasoning capabilities of AI systems. By focusing on problems that are both theoretically deep and practically relevant, and by providing a scalable, principled framework for problem generation and evaluation, the benchmark exposes fundamental limitations of current models and charts a path for future research at the intersection of AI, algorithms, and complexity theory.

PDF Markdown

Follow-up Questions

Related Papers

Authors (13)

Tweets

https://twitter.com/YoavLevine/status/1946229939516359015

https://twitter.com/AlonVin/status/1946289936887337058

https://twitter.com/laginimaineb/status/1946257688347791387

https://twitter.com/TeemuMtt3/status/1946699646786544056

https://twitter.com/susumuota/status/1946723447687516401

https://twitter.com/arxivsanitybot/status/1946404878449852683

Reddit

FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming (25 points, 9 comments)

alphaXiv

FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming (17 likes, 0 questions)