Google NATURAL PLAN Benchmark
- Google NATURAL PLAN Benchmark is a natural language evaluation framework that tests LLMs’ planning abilities on tasks like trip, meeting, and calendar planning using real-world data.
- The benchmark uses detailed constraints from tools like Google Flights, Maps, and Calendar to simulate realistic scenarios, emphasizing multi-hop reasoning and constraint satisfaction.
- Empirical results reveal substantial performance gaps, with all models underperforming human-level, highlighting key limitations in global connectivity and multi-entity scheduling tasks.
Google NATURAL PLAN Benchmark is a fully natural language benchmark specifically designed to rigorously evaluate the planning capabilities of LLMs in scenarios that reflect practical, human-centered planning tasks. Unlike traditional planning benchmarks that rely on synthetic agent environments or formal languages (e.g., PDDL-based IPC tasks), NATURAL PLAN focuses exclusively on end-to-end natural language: both tasks and constraints are described in free-form English, and all relevant contextual data is embedded as static outputs from real-world tools such as Google Flights, Google Maps, or Google Calendar. By eliminating tool-use mechanics and API calls at inference time, NATURAL PLAN isolates the core reasoning and constraint satisfaction challenges inherent in planning, providing a diagnostic probe for current and future generation LLMs (Zheng et al., 6 Jun 2024).
1. Motivations and Conceptual Foundations
NATURAL PLAN is motivated by three primary considerations: realism, diagnostic value, and gap analysis. The benchmark is rooted in practical user scenarios—trip itineraries, meeting chains, and calendar availability—where constraints are presented as imperfect outputs from widely used online tools. This setting aims to approximate real-world planning situations where language alone must mediate multi-step decision-making. By systematically varying factors such as the number of cities, attendees, or scheduling days, NATURAL PLAN enables fine-grained analysis of how the combinatorial structure of constraint complexity modulates LLM planning proficiency. Notably, leading models such as GPT-4 and Gemini 1.5 Pro perform substantially below human-level even when supplied with complete, accurate information, revealing a major shortfall in language-based planning and highlighting specific limitations in generalization and multi-constraint reasoning (Zheng et al., 6 Jun 2024).
2. Task Suite and Structure
NATURAL PLAN is composed of three distinct planning task types, each parameterizable in complexity and united by a common evaluation protocol:
- Trip Planning: Given cities, a global duration (days), city-wise stays, date-anchored commitments, and direct flight tables (from Google Flights), the model must construct a feasible, unambiguous day-by-day itinerary using only permitted flights and honoring all requirements. This task primarily probes multi-hop connectivity and temporal constraint satisfaction.
- Meeting Planning: Starting from a specified location and time, the model must sequence 1:1 meetings with friends (each with location/time-window/minimum duration requirements) while incorporating all driving-time constraints (from Google Maps), choosing an ordering that maximizes the number of feasible meetings. This evaluates the model's capacity for temporal scheduling and path optimization under resource and travel constraints.
- Calendar Scheduling: Given a subset of Google Calendar availabilities and a fixed slot length, the model must find a contiguous interval fitting all invited attendees, either by fixing the date and varying attendees or by fixing two attendees and varying the days on offer. This task emphasizes shared constraint intersection and combinatorial search within discrete temporal grids.
In all cases, model prompts contain (a) a natural language description of the planning task and (b) the relevant, static tool output encoding all constraints and options. Inference involves no live API calls or tool executors.
3. Evaluation Metrics and Complexity Probes
The principal metric is exact-match solve rate, defined as:
A solution is counted as correct only if every extracted constraint-relevant element (dates, times, city names, orderings) matches the reference exactly, using a rigid solution template and regular-expression-based parsing. Complexity curves are generated by sweeping a single variable—number of cities, friends, attendees, or days—and observing solve rate decay. Empirically:
- Trip Planning: Solve rate is profiled across to $10$ cities.
- Meeting Planning: Number of friends varies from 1 to 10.
- Calendar Scheduling: Tracks both attendees (2–7) and days (1–5).
This structure enables plot-based analysis of where and how rapidly model performance deteriorates as combinatorial difficulty rises.
4. Experimental Protocols and Ablation Analyses
Model evaluation in NATURAL PLAN adheres to a standardized five-shot in-context learning regime, using hand-crafted exemplars closely matched to the task template. All relevant contextual information (flights, maps, calendars) is directly embedded in the prompt, enforcing a full-information setting. Solution extraction requires strict adherence to a format that facilitates reliable parsing.
The benchmark incorporates extensive ablations to dissect the effectiveness of different LLM prompting and adaptation methods:
- Few-shot generalization: Contrasts easy-to-hard (simpler to more complex exemplars in context) vs. hard-to-easy (reverse order) settings. For most models, easy-to-hard ordering yields better generalization across increasing , though certain mid-range complexities yield slight reversals in trend for GPT-4 and Gemini 1.5 Flash.
- Self-correction: Introduces an auxiliary prompt in which models are explicitly told to critique and repair their initial response. Across the board, this self-correction regime reduces solve rate by 5–10 percentage points, with more capable models (GPT-4, Gemini Pro) exhibiting the largest degradation, consistent with overconfidence in self-critique.
- In-context planning with long contexts: Gemini 1.5 Pro demonstrates continued improvement with up to 800 context shots (≈355K tokens), reaching 39.9% on Trip Planning and 50% on Calendar Scheduling. In contrast, GPT-4 and Gemini 1.5 Flash decline in performance beyond approximately 20 context shots. This suggests that larger context windows can, for some architectures, substantially bootstrap planning performance.
5. Empirical Results and Key Findings
Table 1 summarizes solve rates for the principal models across the three core tasks in the standard five-shot setting:
| Task | GPT-3.5 | GPT-4 | GPT-4o | Gemini 1.5 Flash | Gemini 1.5 Pro |
|---|---|---|---|---|---|
| Trip Planning | 7.3 | 31.1 | 3.7 | 25.6 | 34.8 |
| Meeting Planning | 19.1 | 47.0 | 45.2 | 23.9 | 39.1 |
| Calendar Scheduling | 19.9 | 41.2 | 43.7 | 34.3 | 48.9 |
Key observations:
- No model exceeds 50% solve rate on any task, even with perfect constraint visibility.
- Trip Planning is especially challenging, with the top model (Gemini 1.5 Pro) reaching only 34.8%; GPT-4o catastrophically fails at 3.7%.
- Performance drops sharply with increasing complexity: for 10 cities in Trip Planning, all models fall below 5%; for more than eight friends in Meeting Planning, solve rates drop under 10%; Calendar Scheduling demonstrates a more gradual but consistent decline, falling below 20% for seven attendees.
Additional findings from ablations:
- Easy-to-hard few-shot ordering generally outperforms hard-to-easy, but mid-range exceptions exist.
- Self-correction consistently worsens performance, reinforcing limits of naïve prompt-based self-reflection.
- Long-context in-context learning substantially benefits Gemini 1.5 Pro, though only up to a point for other models.
6. Interpretation and Implications
NATURAL PLAN reveals a pronounced gap between the language modeling and planning abilities of current LLMs, even under idealized circumstances reflecting practical, real-world planning use-cases. The benchmark establishes that multi-step, constraint-satisfaction planning in natural language remains an unsolved frontier for LLM research. Particular model weaknesses are evident in handling global connectivity (trip flights) and multi-entity scheduling (large meeting sets). Self-correction—often assumed to be beneficial—can exacerbate errors due to overconfident misdiagnosis, undermining its effectiveness as a general remedy. Long-context learning emerges as a partial solution, but only for architectures and regimes capable of maintaining coherence over hundreds of context examples.
The benchmark suggests several directions for advancing LLM planning abilities:
- Integrating structured constraint-solving modules or hybrid neuro-symbolic approaches.
- Developing curriculum strategies or pretraining specifically designed to emphasize constraint satisfaction mechanisms.
- Utilizing fine-tuned adapters that translate natural language plans into internal symbolic (e.g., PDDL-like) representations.
- Refining self-critique to operate on specific constraint subdomains rather than global re-generation.
7. Role in LLM Evaluation and Research Trajectory
By eliminating tool-use mechanics and focusing on the core reasoning bottlenecks of multi-step planning, NATURAL PLAN offers a standardized, challenging testbed capable of catalyzing progress in LLM reasoning and planning research. Its structured decomposition of tasks and constraints, paired with comprehensive diagnostic ablations, enables detailed comparison across architectures, prompting regimes, and adaptation strategies. The results underscore the need for new methods that explicitly target multi-hop reasoning, global constraint satisfaction, and robust generalization under increasing combinatorial complexity (Zheng et al., 6 Jun 2024).