Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1 (2410.02162v1)

Published 3 Oct 2024 in cs.AI

Abstract: The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of LLMs, there has been considerable interest in the question of whether or not they possess such planning abilities, but -- despite the slew of new private and open source LLMs since GPT3 -- progress has remained slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs -- making it a new kind of model: a Large Reasoning Model (LRM). In this paper, we evaluate the planning capabilities of two LRMs (o1-preview and o1-mini) on both planning and scheduling benchmarks. We see that while o1 does seem to offer significant improvements over autoregressive LLMs, this comes at a steep inference cost, while still failing to provide any guarantees over what it generates. We also show that combining o1 models with external verifiers -- in a so-called LRM-Modulo system -- guarantees the correctness of the combined system's output while further improving performance.

Citations (1)

View on Semantic Scholar

Summary

The paper evaluates OpenAI's new o1 Large Reasoning Models (LRMs) for AI planning and scheduling tasks, demonstrating significant performance improvements over traditional Large Language Models (LLMs) on standard benchmarks.
While achieving high success rates in simpler planning scenarios (e.g., 97.8% in Blocksworld), o1 models face challenges in complex instances and are computationally much more intensive than traditional planning systems.
The study proposes an LRM-Modulo framework integrating external verifiers, which substantially enhances performance and provides soundness guarantees, highlighting the need for rigorous new benchmarks and improved efficiency for real-world LRM deployment.

Evaluation and Performance of OpenAI’s o1: A New Chapter in AI Planning and Scheduling

The paper "Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1" explores the capabilities of OpenAI’s new models, o1-preview and o1-mini, in planning and scheduling tasks. These models, dubbed Large Reasoning Models (LRMs), signify a departure from traditional autoregressive LLMs by promising enhanced reasoning capabilities. The research scrutinizes their performance against established benchmarks and posits improvements through an LRM-Modulo framework.

Planning Evaluation

Performance Against Traditional LLMs: The paper critically examines prior LLMs' struggling performance on traditional planning benchmarks. It cites evaluations with PlanBench, which highlight their inadequate handling of even straightforward block stacking problems when formulated in machine-readable PDDL (Planning Domain Definition Language).
Introducing LRMs: The o1 models, engineered to function as approximate reasoners rather than mere text completers, were evaluated on the PlanBench dataset. The results reveal significant improvements compared to LLMs, particularly in non-obfuscated Blocksworld domains, indicating a 97.8% success rate in generating correct plans—a substantial leap from LLM performance. Nevertheless, performance dip is noted in more challenging instances, such as the Mystery Blocksworld, where the success rate drops to about 52.8%.
Evaluation of Unsatisfiable Instances: The o1 models demonstrated an emerging capacity to recognize when no valid plan exists, correctly identifying some unsolvable instances, although with several incorrect classifications, indicating room for improvement in reliability.
Efficiency Considerations: The paper acknowledges the steep computational costs associated with LRM inference. o1's operations are notably more resource-intensive than both traditional planning systems like Fast Downward, which solve problems almost instantaneously, and prior LLMs.

Scheduling Evaluation

o1 models were tested on benchmarks such as Natural Plan and Travel Planning:

Natural Plan schedules saw o1-mini achieving 94% accuracy in simpler calendar scheduling tasks. However, on more complex scheduling domains like trip planning, performance waned, suggesting its prowess in selected tasks.
Travel Planning: Results showed only incremental improvements over prior models, reinforcing the need for dedicated strategies to harness full capabilities in diverse scheduling contexts.

The LRM-Modulo Framework

The paper convincingly argues for integrating o1 models within an LRM-Modulo framework, which incorporates external verifiers to ensure the correctness and feasibility of generated plans. This approach notably elevated the planning performance in challenging benchmarks, offering soundness guarantees, a feature yet elusive in standalone LM or LRM systems. It effectively balances the computational burdens of o1 models by limiting the need for multiple costly runs.

Implications and Future Directions

The advent of LRMs like o1, aimed at tackling more complex reasoning tasks, opens several pathways:

Speculative Future Enhancements: Inference-time flexibility and adaptation might emerge as a game-changer in resource management, potentially mitigating current cost barriers if user control over reasoning computations is realized.
Rigorous Benchmarks Needed: As LRMs tread the System 2 reasoning territory, the paper underscores the necessity for developing novel, rigorous benchmarks to holistically evaluate these models, paving the way for LRM-centric metrics that transcend token accuracy.
Critical Deployment Insights: For LRMs to transition from academic curiosity to real-world applicability, particularly in cost-sensitive or mission-critical domains, a paradigm shift is necessary for both architectural transparency and operational efficiency.

In essence, OpenAI's o1 exemplifies a promising stride in AI planning with substantial performance improvements over its predecessors. Yet, its journey reflects an early stage in realizing generalized reasoning under practical constraints, with the paper advocating for frameworks and methodologies that provide robustness and assurances required for broader adoption.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Tweets

https://twitter.com/rao2z/status/1881758086718976306

https://twitter.com/rao2z/status/1865422879883428147

https://twitter.com/susumuota/status/1848515917916815628

YouTube

Show All Videos