Exploring and Benchmarking the Planning Capabilities of Large Language Models (2406.13094v2)

Published 18 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Classical and natural language planning tasks remain a difficult domain for modern LLMs. In this work, we lay the foundations for improving planning capabilities of LLMs. First, we construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios. This suite includes algorithms to methodically generate instances of tasks with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Next, we investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance. In addition, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths. We also probe the efficacy of chain-of-thought reasoning methods to improve LLM planning performance. Moreover, we probe the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges. Finally, we investigate model's failure modes and reveal insights that hold true across different benchmarks.

Authors (9)

Bernd Bohnet (21 papers)
Azade Nova (13 papers)
Aaron T Parisi (2 papers)
Kevin Swersky (51 papers)
Katayoon Goshvadi (2 papers)
Hanjun Dai (63 papers)
Dale Schuurmans (112 papers)
Noah Fiedel (22 papers)
Hanie Sedghi (35 papers)

Citations (3)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/mctalentowen/status/1805501218242052214

Exploring and Benchmarking the Planning Capabilities of Large Language Models (2406.13094v2)

Summary

Related Papers

Tweets