Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NATURAL PLAN: Benchmarking LLMs on Natural Language Planning (2406.04520v1)

Published 6 Jun 2024 in cs.CL and cs.AI
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

Abstract: We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.

Natural Plan: Benchmarking LLMs on Natural Language Planning

The paper "Natural Plan: Benchmarking LLMs on Natural Language Planning" by Zheng et al., introduces Natural Plan, a novel benchmark devised to evaluate the planning capabilities of state-of-the-art LLMs. Focusing on tasks that demand intricate planning in natural language, the benchmark includes trip planning, meeting scheduling, and calendar scheduling. The evaluation context is furnished with data from established tools like Google Flights, Google Maps, and Google Calendar.

Benchmark Design

Natural Plan features three distinct types of planning tasks:

  1. Trip Planning: This task involves crafting a travel itinerary across multiple cities with specific constraints such as travel dates and meeting appointments. The task complexity is heightened by factors like the number of cities and the need for direct flight connectivity.
  2. Meeting Planning: This task targets scheduling meetings with multiple individuals at various locations, optimizing for constraints like availability and travel time.
  3. Calendar Scheduling: This involves arranging meetings between multiple participants given their existing schedules and work constraints.

The benchmark decouples tool use from reasoning by providing tool outputs as direct input to the models, thus isolating the planning task to be performed solely in natural language.

Evaluation

The evaluation conducted on several state-of-the-art LLMs, including GPT-4, GPT-3.5, GPT-4o, and Gemini 1.5 Pro, reveals that even the best-performing models struggle with the complexity of natural language planning tasks. For instance, GPT-4 and Gemini 1.5 Pro achieved only 31.1% and 34.8% success rates, respectively, on the trip planning task. As the number of cities increased, the performance of all models diminished, dropping below 5% with 10 cities.

Detailed Findings

  1. Constraint Complexity: There's a marked reduction in model performance with increasing task complexity. For example, in trip planning, performance drastically falls as the number of cities increases. The same trend is observed in meeting planning and calendar scheduling tasks with an increasing number of participants or work days.
  2. Few-Shot Generalization: Models demonstrated better performance in easy-to-hard generalization compared to hard-to-easy generalization. This indicates a difficulty in leveraging complex in-context exemplars for LLMs.
  3. Self-Correction Mechanisms: The self-correction approach did not improve performance and, in fact, resulted in a drop in accuracy. This suggests that the models might be overconfident in their error assessment and correction abilities, leading to further errors.
  4. Long-Context Planning: The in-context planning experiments utilizing long-context capabilities showed promising results, especially for Gemini 1.5 Pro, which improved accuracy up to 40% with increased context length. This emphasizes the potential of leveraging long sequences of textual data for enhanced planning.

Implications and Future Directions

The results underscore a substantial gap between current LLM capabilities and the requirements for efficient natural language planning. These findings highlight several potential areas for future research and development. Enhancing the ability of LLMs to handle complex constraints and contextual information will be critical. Additionally, improving generalization from complex examples and refining self-correction mechanisms could significantly boost performance.

Moreover, the empirical evidence from the performance of LLMs on long-context planning suggests that further developments in context management and sequence processing could yield significant improvements. Future studies might explore hybrid approaches that integrate classical planning algorithms with on-the-fly natural language understanding provided by LLMs.

In summary, the introduction of Natural Plan provides a highly realistic and challenging benchmark for assessing the planning capabilities of LLMs. The paper's findings reveal significant areas for improvement, thereby guiding future research directions in advancing the state of natural language planning with artificial intelligence. This benchmark will serve as an essential tool for researchers to evaluate and enhance the capabilities of next-generation LLMs in real-world application scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Huaixiu Steven Zheng (11 papers)
  2. Swaroop Mishra (60 papers)
  3. Hugh Zhang (13 papers)
  4. Xinyun Chen (80 papers)
  5. Minmin Chen (35 papers)
  6. Azade Nova (13 papers)
  7. Le Hou (36 papers)
  8. Heng-Tze Cheng (16 papers)
  9. Quoc V. Le (128 papers)
  10. Ed H. Chi (74 papers)
  11. Denny Zhou (65 papers)
Citations (19)