PlanBench: LLM Planning Benchmark

Updated 18 February 2026

PlanBench is a systematic benchmark suite that assesses LLM planning using formal, PDDL-based tasks.
It employs automated instance generation and symbolic validation to enable consistent, apples-to-apples model comparisons.
The suite evaluates capabilities such as plan generation, cost optimization, verification, and robust replanning, offering practical insights into model performance.

PlanBench is a systematic, extensible benchmark suite designed to rigorously evaluate the planning and reasoning capabilities of LLMs and, more recently, large reasoning models (LRMs). It is grounded in the formal planning domains of the automated planning community—especially those used for the International Planning Competition (IPC)—and aims to probe genuine model-based reasoning about actions and change, distinguishing it from mere text pattern recognition or retrieval. PlanBench enables apples-to-apples comparison of model performance through comprehensive task coverage, automated instance generation, and strict validation leveraging symbolic planners and interpreters (Valmeekam et al., 2022, Valmeekam et al., 2024, Xiong et al., 2 May 2025, Malfa et al., 10 Dec 2025, Stein et al., 2023, Ramani et al., 3 Oct 2025).

1. Task Domains and Structure

PlanBench instantiates classical planning domains that have well-defined semantics, bounded complexity, and broad coverage of key planning phenomena. The canonical domains include:

Blocksworld: A set of blocks placed on a table, manipulated with a single robotic hand. The planning challenge involves stacking, unstacking, and rearranging blocks, with preconditions such as a block being clear before it can be moved.
Logistics: Transportation of packages across multiple cities using trucks (intra-city) and airplanes (inter-city) under object and location constraints.
Depots: Combines transport (trucks) and storage (hoists, crates) with complex multi-step dependencies.
Obfuscated Variants: All identifiers (predicate, action, object) are replaced with random or irrelevant tokens, removing surface cues and forcing reliance on the formal structure of planning representations.
Mystery Blocksworld: Identical to Blocksworld in structure but with arbitrary string names to further suppress pattern-based retrieval.

Each domain is specified with an explicit Planning Domain Definition Language (PDDL) fragment, including full action schemas, state representations, and deterministic semantics. Tasks follow the quintuple $P = (S, A, T, s_0, G)$ , where $S$ is the set of symbolic states, $A$ the action set, $T$ the deterministic transition function, $s_0$ the initial state, and $G$ the goal specification (Xiong et al., 2 May 2025).

2. Planning Capabilities and Task Taxonomy

PlanBench covers a rich taxonomy of reasoning and planning competencies, with automated instance generators selecting examples that isolate distinct capabilities:

Plan Generation: Outputting a valid action sequence achieving the specified goal, starting from the initial state.
Cost-Optimal Planning: Producing plans that minimize an explicit cost function, typically the number of steps or action-specific weights.
Plan Verification: Determining whether a proposed plan, expressed in natural language or PDDL, achieves the goal when simulated.
Reasoning About Plan Execution: Making predictions or answering queries about intermediate states reached after a partial plan execution.
Robustness to Goal Reformulation: Solving variants of the goal (e.g., shuffled or partial goals).
Plan Reuse and Prefix Reuse: Generalizing from prefixes of known plans to solve related problem instances.
Replanning: Generating new valid continuations when the state is altered mid-execution (unexpected changes).
Plan Generalization: Extracting and reusing inductive action patterns.

Each category is instantiated with hundreds of unique problem instances, systematically varying domain size, plan length, and perturbations (Valmeekam et al., 2022, Valmeekam et al., 2024).

3. Evaluation Methodology and Metrics

PlanBench employs automated, standardized evaluation using symbolic planners (e.g., Fast Downward) and plan validators (e.g., VAL):

Primary Metrics:

Success Rate (SR): $\text{SR} = \frac{C}{N}$ , where $C$ is the number of goal-reaching plans and $N$ is the total number of test instances.
Validity Rate (VR): $\text{VR} = \frac{S}{N}$ , with $S$ the count of syntactically well-formed plans.
Correctness Given Validity (CGV): $\text{CGV} = \frac{C}{S}$ for $S > 0$ .
Optimality Rate (OR): $\text{OR} = \frac{O}{C}$ , where $O$ is the count of strictly optimal plans among correct ones.
Average Planning Depth (APD): Mean plan length among successful runs.
Average Inference Time (AIT): Mean wall-clock time per instance.
Dollar Cost per 100 Instances (DC100): Based on public API pricing, accounting for hidden reasoning-token consumption in LRMs (Valmeekam et al., 2024).

For plan verification tasks, standard classification metrics are used:

Accuracy, Precision, Recall, F $_1$ (computed on valid/invalid plan classification, omitting instances labeled "Unknown") (Ramani et al., 3 Oct 2025).

Plans are parsed from LLM output, translated into action sequences, and rigorously checked for both syntactic validity and goal fulfillment in a symbolic interpreter.

4. Prompt Generation and Experimental Protocols

PlanBench supports fair and systematic prompting through both automated and manual pipelines:

Automated NL Prompt Generation: Tools such as AutoPlanBench convert PDDL domains and problems into natural language prompts via LLMs, generating predicate and action templates, domain summaries, initial/goal state descriptions, and optional few-shot rationales in several styles (Basic, Chain-of-Thought, Act, ReAct) (Stein et al., 2023). This approach enables large-scale, model-agnostic evaluation with no manual leak of domain knowledge.
Evaluation Modes:
- Zero-shot and One-shot Prompting: Explicit enumeration of action schemas, preconditions, and effects (in natural or obfuscated language), followed by the instance; occasionally, a worked example is included (one-shot).
- No Fine-Tuning: All reported results use out-of-the-box model APIs without domain-specific adaptation, except for frameworks explicitly designed for integration (SymPlanner, Agentic LLM Orchestration) (Xiong et al., 2 May 2025, Malfa et al., 10 Dec 2025).
Model Families Tested: GPT-4, GPT-4 Turbo, GPT-4o, Claude 3.5/3, LLaMA 3 405B/70B, Gemini Pro, OpenAI o1-preview and o1-mini (LRMs), with classical planners as baselines.

5. Key Findings and Quantitative Results

Empirical studies using PlanBench reveal the limits and strengths of contemporary LLMs and highlight the class distinction of LRMs.

Aggregate Success Rates on 600 Blocksworld Instances (Valmeekam et al., 2024): | Model | Zero-Shot SR | One-Shot SR | |-----------------------|-------------:|------------:| | LLaMA 3 405B | 62.6 % | 47.3 % | | GPT-4 | 34.6 % | 34.3 % | | GPT-4o | 35.5 % | 28.3 % | | o1-preview (LRM) | 97.8 % | — | | Fast Downward | 100 % | — |

Obfuscated (Mystery) Blocksworld Performance:

All LLMs: SR ≤ 4.3 %; o1-preview: 52.8 %; Fast Downward: 100 %

Plan Verification (Simplified PlanBench Dataset, GPT-5, One-shot) (Ramani et al., 3 Oct 2025):

Accuracy: 95.89 %
F $_1$ : 96.30 % (Unknown rate: 6.87 %)

Agentic and Symbolic Hybrid Approaches:

SymPlanner achieves 54.2 % aggregate success on Blocksworld (GPT-4.1), outperforming all classical LLM approaches as horizon increases, but still lagging classical solvers (Xiong et al., 2 May 2025).
Agentic LLM orchestrator pipelines reach average accuracy of 55 % on PlanBench's diverse domains, with validated plans exceeding 80 % in Logistics/Depots (Malfa et al., 10 Dec 2025).

LLMs remain brittle in long-horizon, type-rich, or obfuscated settings, often hallucinating action schemas or misordering steps. The LRM o1 achieves near-saturation on standard Blocksworld but suffers on problems requiring ≥20 steps and cannot reliably recognize unsolvable cases. Plan verification using LLM-generated formal models attains high syntactic but lower semantic fidelity (Valmeekam et al., 2024, Ramani et al., 3 Oct 2025).

6. Analysis, Limitations, and Hybrid Directions

Error Taxonomy and Failure Modes:

Incomplete plans (omission or truncation),
Hallucinated actions (absent in the schema),
Catastrophic sensitivity to symbol obfuscation (surface-pattern memorization over reasoning),
Poor unsolvability detection,
Rapid degradation as plan depth increases.

Hybrid Architectures and Validation Loops:

Integrating explicit symbolic environments, iterative correction mechanisms, or verifier feedback (e.g., SymPlanner, agentic LLM pipelines) yields substantial improvements in plan accuracy and robustness, particularly on longer or trickier instances (Xiong et al., 2 May 2025, Malfa et al., 10 Dec 2025).
Maximum performance is bounded by the soundness and coverage of the PDDL translation/validation pipeline and the semantic match between LLM-generated plans and formal goal definitions.
Black-box LRMs (such as o1) are significantly more costly per inference and offer limited controllability or introspection compared to open-architecture hybrids or classical planners.

7. Extensibility and Future Directions

PlanBench is designed for rapid expansion and fair comparison across novel planning domains or formalisms:

Domain Extensibility: Adding new PDDL models, problem generators, and NL translators suffices for new domains (numeric fluents, temporal constraints, hierarchical tasks) (Valmeekam et al., 2022, Stein et al., 2023).
Partial-credit Metrics: Plans may be scored by fraction of valid intermediate actions or covered subgoals, beyond binary success/failure.
Verifiable Reasoning Traces: Benchmarks are beginning to demand auditable, formally checkable chain-of-thought traces, especially for safety-critical or high-assurance applications.
Adaptive Inference and Curriculum Learning: Controlling plan-horizon budgets and learning instance selection will be increasingly relevant as benchmarks scale.

PlanBench continues to serve as a yardstick for progress in genuine model-based reasoning in natural LLMs, quantifying persistent gaps between neural sequence models and classical symbolic planners, and shaping the development and evaluation of hybrid reasoning systems (Valmeekam et al., 2022, Valmeekam et al., 2024, Xiong et al., 2 May 2025, Malfa et al., 10 Dec 2025, Stein et al., 2023, Ramani et al., 3 Oct 2025).