- The paper introduces a benchmark that evaluates AI's deep algorithmic reasoning beyond traditional competitive programming tasks.
- It employs MSO logic and tree decompositions to generate dynamic programming problems requiring up to 15 intertwined reasoning steps.
- Empirical tests reveal that state-of-the-art LLMs solve less than 1% of the hard problems, highlighting significant model limitations.
The FormulaOne benchmark introduces a rigorous and principled framework for evaluating the depth of algorithmic reasoning in AI systems, with a focus on dynamic programming over graphs defined by Monadic Second-Order (MSO) logic. The benchmark is motivated by the observation that, despite recent advances in LLMs and their strong performance on competitive programming platforms, these models remain fundamentally limited in their ability to solve problems that require multi-step, abstract, and combinatorial reasoning characteristic of real-world research and industrial optimization tasks.
Benchmark Construction and Theoretical Foundations
FormulaOne is constructed around three core principles:
- Depth of Reasoning: The problems are designed to require an overview of topological, geometric, and combinatorial insights, as well as precise implementation. Unlike competitive programming problems, which often emphasize clever tricks or isolated algorithmic ideas, FormulaOne tasks demand the orchestration of numerous interdependent reasoning steps. The authors provide concrete evidence that some problems require at least 15 intertwined mathematical steps for a complete solution.
- Unbounded, Principled Problem Generation: Leveraging Courcelle’s theorem, which states that any property expressible in MSO logic can be solved in linear time on graphs of bounded treewidth, the benchmark enables the semi-automatic generation of a vast and diverse set of problems. This approach allows for scalable creation of high-depth algorithmic challenges, suitable for both evaluation and reinforcement learning with verifiable rewards (RLVR).
- Connection to Theoretical Computer Science: Many problems in FormulaOne are closely related to central conjectures in fine-grained complexity, such as the Strong Exponential Time Hypothesis (SETH). Any significant algorithmic progress on these problems would have direct implications for complexity theory, as the best-known algorithms are believed to be optimal under SETH.
Dataset Design and Problem Structure
The dataset consists of two main components:
- FormulaOne: 120 challenging dynamic programming problems on graphs, each defined by an MSO formula and accompanied by a tree decomposition and weight function. The problems span a wide range of graph-theoretic properties, including connectivity, domination, forbidden subgraphs, and extremal structures.
- FormulaOne-Warmup: 100 simpler problems from the same distribution, intended to facilitate research and model development.
Each problem is formulated as a weighted model counting (WMC) task: given a graph, its tree decomposition, and vertex weights, compute the sum of weights of all subsets satisfying the MSO-defined property, modulo a large prime. The input format and evaluation environment are standardized, with the environment handling parsing, graph representation, and tree decomposition traversal. The model’s task is restricted to implementing the core dynamic programming logic via five callback functions (leaf, introduce, forget, join, extract_solution), isolating the reasoning challenge from engineering overhead.
Complexity and State Design
The paper provides a detailed exposition of the challenges inherent in MSO-based dynamic programming on tree decompositions. Key implementation considerations include:
- State Representation: Designing a minimal yet sufficient state profile for each bag in the tree decomposition is non-trivial. The state must capture all relevant information about partial solutions, including connectivity partitions, domination status, forbidden subgraph configurations, and global invariants.
- Transition Logic: Correctly handling the transitions at introduce, forget, and join nodes is critical. The authors highlight common pitfalls, such as premature invalidation of states, failure to cap solution cardinality, incorrect merging of connectivity partitions, and double-counting in join operations.
- Geometric and Logical Reasoning: Many problems require tracking complex geometric configurations (e.g., induced paths or cycles) and reasoning about the interaction between local and global properties. The benchmark includes problems that necessitate the use of non-trivial graph-theoretic theorems and logical reductions to achieve efficient solutions.
Evaluation Methodology
The evaluation framework is comprehensive and rigorous:
- Automated Verification: Each problem is accompanied by a verifier, enabling automatic checking of candidate solutions against a suite of tests, including correctness (via brute-force on small instances), consistency (invariance to tree decomposition), and efficiency (scaling to large graphs).
- Problem Annotation: Problems are labeled according to the algorithmic skills and state-design techniques required, allowing for fine-grained analysis of model performance across categories such as adjacency, connectivity, extremal properties, logic, topology, and more.
- Model Assessment: Leading LLMs (OpenAI o3, o3-Pro, Gemini 2.5 Pro, Grok 4 Heavy) were evaluated under generous prompting and support. Despite this, the best models solved less than 1% of the hard problems, with only marginally better performance on the warmup set. The analysis identifies central failure modes, including premature finalization, incomplete geometric reasoning, local-to-global errors, and non-canonical state representation.
Numerical Results and Claims
- Empirical Finding: All evaluated frontier models, including those that achieve superhuman performance on competitive programming benchmarks, fail almost entirely on FormulaOne, solving at most 1 out of 120 problems even with 10 attempts and extensive few-shot examples.
- Claim: The gap between current model capabilities and the reasoning required for these problems is fundamental and not addressable by prompt engineering or minor architectural tweaks.
- Implication: The results demonstrate that current benchmarks are insufficient for measuring the depth of algorithmic reasoning needed for real-world research and optimization tasks. FormulaOne exposes a regime where LLMs’ emergent capabilities are inadequate, and where progress may require fundamentally new approaches, such as explicit search or symbolic reasoning.
Practical and Theoretical Implications
Practical Implications:
- Benchmarking: FormulaOne provides a principled, extensible, and scalable benchmark for evaluating and training AI systems on deep algorithmic reasoning tasks. Its design is well-suited for RLVR and for driving progress in automated scientific discovery.
- Model Development: The benchmark highlights the need for models that can perform structured, multi-step reasoning, and that can synthesize combinatorial, geometric, and logical insights. It suggests that future systems may need to integrate symbolic methods, search, or program synthesis techniques.
- Industrial Relevance: The problems are directly motivated by real-world applications in routing, scheduling, and network design, making progress on this benchmark relevant for commercial optimization.
Theoretical Implications:
- Complexity Theory: Any significant improvement on the benchmark’s hard problems would have direct consequences for fine-grained complexity, potentially refuting SETH for certain classes of problems.
- Open-Ended Problem Generation: The MSO-based framework enables the creation of an essentially unbounded suite of high-depth problems, supporting research into open-ended scientific discovery and automated theorem proving.
Future Directions
- Expanding Problem Classes: The current release focuses on WMC objectives and MSO1 logic. Extensions to MSO2, optimization variants, and other graph parameters (e.g., clique-width, pathwidth) are natural next steps.
- Integrated Tree Decomposition: Requiring models to generate tree decompositions, rather than providing them, would further increase the challenge and realism of the benchmark.
- Hybrid Approaches: The results motivate research into hybrid neuro-symbolic systems, explicit search, and program synthesis as means to bridge the observed gap in reasoning depth.
- RL Environments: The benchmark’s structure is ideal for constructing RL environments with verifiable rewards, supporting the development of agents capable of open-ended algorithmic discovery.
Conclusion
FormulaOne establishes a new standard for evaluating and advancing the algorithmic reasoning capabilities of AI systems. By focusing on problems that are both theoretically deep and practically relevant, and by providing a scalable, principled framework for problem generation and evaluation, the benchmark exposes fundamental limitations of current models and charts a path for future research at the intersection of AI, algorithms, and complexity theory.