PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change (2206.10498v4)

Published 21 Jun 2022 in cs.CL and cs.AI

Abstract: Generating plans of action, and reasoning about change have long been considered a core competence of intelligent agents. It is thus no surprise that evaluating the planning and reasoning capabilities of LLMs has become a hot topic of research. Most claims about LLM planning capabilities are however based on common sense tasks-where it becomes hard to tell whether LLMs are planning or merely retrieving from their vast world knowledge. There is a strong need for systematic and extensible planning benchmarks with sufficient diversity to evaluate whether LLMs have innate planning capabilities. Motivated by this, we propose PlanBench, an extensible benchmark suite based on the kinds of domains used in the automated planning community, especially in the International Planning Competition, to test the capabilities of LLMs in planning or reasoning about actions and change. PlanBench provides sufficient diversity in both the task domains and the specific planning capabilities. Our studies also show that on many critical capabilities-including plan generation-LLM performance falls quite short, even with the SOTA models. PlanBench can thus function as a useful marker of progress of LLMs in planning and reasoning.

PDF Abstract

Evaluating LLMs on Planning and Reasoning: Insights from PlanBench

The paper entitled "PlanBench: An Extensible Benchmark for Evaluating LLMs on Planning and Reasoning about Change" introduces PlanBench, a benchmark suite designed to systematically evaluate the planning and reasoning capabilities of LLMs. The research addresses a fundamental issue: the inadequacy of existing benchmarks to differentiate between genuine planning capabilities and mere knowledge retrieval based on training data. The authors propose PlanBench to overcome this limitation by providing a diverse set of planning tasks grounded in structured models.

The benchmark is motivated by the need to assess whether LLMs truly possess innate planning abilities. PlanBench is structured around well-defined domains from the automated planning community, particularly those used in the International Planning Competition (IPC). The framework focuses on testing various planning-oriented tasks such as plan generation, cost-optimal planning, and plan verification, among others.

PlanBench LLMs are evaluated on several core capabilities:

Plan Generation: Evaluates the LLM's ability to generate valid plans to achieve specified goals.
Cost-Optimal Planning: Assesses whether the LLM can generate the least costly plans.
Plan Verification: Tests whether the LLM can determine the validity of a given plan.
Reasoning About Plan Execution: Explores the LLM's capability to predict state outcomes from action sequences.
Robustness to Goal Reformulation: Investigates if LLMs can recognize goals presented differently.
Plan Reuse and Replanning: Examines the ability to reuse plans or adapt to unexpected changes.
Plan Generalization: Evaluates the model’s ability to infer patterns and generalize plans to new instances.

The implementation of PlanBench is based on familiar domains such as Blocksworld and Logistics, employing structured descriptions to ensure robustness in testing genuine planning capabilities. This allows for a comparative analysis of performance across different LLMs. The paper reports initial results from GPT-4 and Instruct-GPT3 on these domains, yielding suboptimal performance, thereby indicating that LLMs still struggle with basic planning tasks.

The implications of these results suggest that current LLMs are not yet equipped to handle nuanced reasoning about actions and change. The lack of significant proficiency in even simple planning tasks emphasizes the necessity of PlanBench as a benchmarking tool to measure and encourage the progress of LLMs in this domain.

The quantitative results highlight the performance gaps, with GPT-4 achieving only 34.3% accuracy in the Plan Generation task. Such metrics underscore the gap between anecdotal claims of LLM capabilities and their actual performance in structured evaluations like PlanBench.

In conclusion, PlanBench serves as a crucial step toward a more rigorous and systematic evaluation of LLMs in planning contexts. The framework's extensibility offers opportunities for future research to incorporate additional domains and refine evaluation metrics. This benchmark sets the stage for continued exploration into enhancing LLMs' reasoning abilities, presenting a clear pathway for the development of more capable AI systems in the future.