Natural Plan: Benchmarking LLMs on Natural Language Planning
The paper "Natural Plan: Benchmarking LLMs on Natural Language Planning" by Zheng et al., introduces Natural Plan, a novel benchmark devised to evaluate the planning capabilities of state-of-the-art LLMs. Focusing on tasks that demand intricate planning in natural language, the benchmark includes trip planning, meeting scheduling, and calendar scheduling. The evaluation context is furnished with data from established tools like Google Flights, Google Maps, and Google Calendar.
Benchmark Design
Natural Plan features three distinct types of planning tasks:
- Trip Planning: This task involves crafting a travel itinerary across multiple cities with specific constraints such as travel dates and meeting appointments. The task complexity is heightened by factors like the number of cities and the need for direct flight connectivity.
- Meeting Planning: This task targets scheduling meetings with multiple individuals at various locations, optimizing for constraints like availability and travel time.
- Calendar Scheduling: This involves arranging meetings between multiple participants given their existing schedules and work constraints.
The benchmark decouples tool use from reasoning by providing tool outputs as direct input to the models, thus isolating the planning task to be performed solely in natural language.
Evaluation
The evaluation conducted on several state-of-the-art LLMs, including GPT-4, GPT-3.5, GPT-4o, and Gemini 1.5 Pro, reveals that even the best-performing models struggle with the complexity of natural language planning tasks. For instance, GPT-4 and Gemini 1.5 Pro achieved only 31.1% and 34.8% success rates, respectively, on the trip planning task. As the number of cities increased, the performance of all models diminished, dropping below 5% with 10 cities.
Detailed Findings
- Constraint Complexity: There's a marked reduction in model performance with increasing task complexity. For example, in trip planning, performance drastically falls as the number of cities increases. The same trend is observed in meeting planning and calendar scheduling tasks with an increasing number of participants or work days.
- Few-Shot Generalization: Models demonstrated better performance in easy-to-hard generalization compared to hard-to-easy generalization. This indicates a difficulty in leveraging complex in-context exemplars for LLMs.
- Self-Correction Mechanisms: The self-correction approach did not improve performance and, in fact, resulted in a drop in accuracy. This suggests that the models might be overconfident in their error assessment and correction abilities, leading to further errors.
- Long-Context Planning: The in-context planning experiments utilizing long-context capabilities showed promising results, especially for Gemini 1.5 Pro, which improved accuracy up to 40% with increased context length. This emphasizes the potential of leveraging long sequences of textual data for enhanced planning.
Implications and Future Directions
The results underscore a substantial gap between current LLM capabilities and the requirements for efficient natural language planning. These findings highlight several potential areas for future research and development. Enhancing the ability of LLMs to handle complex constraints and contextual information will be critical. Additionally, improving generalization from complex examples and refining self-correction mechanisms could significantly boost performance.
Moreover, the empirical evidence from the performance of LLMs on long-context planning suggests that further developments in context management and sequence processing could yield significant improvements. Future studies might explore hybrid approaches that integrate classical planning algorithms with on-the-fly natural language understanding provided by LLMs.
In summary, the introduction of Natural Plan provides a highly realistic and challenging benchmark for assessing the planning capabilities of LLMs. The paper's findings reveal significant areas for improvement, thereby guiding future research directions in advancing the state of natural language planning with artificial intelligence. This benchmark will serve as an essential tool for researchers to evaluate and enhance the capabilities of next-generation LLMs in real-world application scenarios.