Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring and Benchmarking the Planning Capabilities of Large Language Models (2406.13094v2)

Published 18 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Classical and natural language planning tasks remain a difficult domain for modern LLMs. In this work, we lay the foundations for improving planning capabilities of LLMs. First, we construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios. This suite includes algorithms to methodically generate instances of tasks with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Next, we investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance. In addition, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths. We also probe the efficacy of chain-of-thought reasoning methods to improve LLM planning performance. Moreover, we probe the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges. Finally, we investigate model's failure modes and reveal insights that hold true across different benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Bernd Bohnet (21 papers)
  2. Azade Nova (13 papers)
  3. Aaron T Parisi (2 papers)
  4. Kevin Swersky (51 papers)
  5. Katayoon Goshvadi (2 papers)
  6. Hanjun Dai (63 papers)
  7. Dale Schuurmans (112 papers)
  8. Noah Fiedel (22 papers)
  9. Hanie Sedghi (35 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com