PlanBench: Automated Planning Evaluation

Updated 29 November 2025

PlanBench is an extensible benchmark designed to rigorously evaluate LLMs and LRMs across classical planning, reasoning about change, and plan verification tasks using PDDL formulations.
It employs dual encodings with natural-language templates and canonical PDDL files, ensuring mechanical validation and reproducibility across diverse planning domains.
Its comprehensive test taxonomy spans plan generation, cost optimization, execution prediction, and generalization, providing clear metrics for model performance and robustness.

PlanBench is an extensible benchmark designed for the rigorous evaluation of LLMs and large reasoning models (LRMs) on classical automated planning, reasoning about change, and plan verification. Developed in the tradition of the International Planning Competition (IPC), PlanBench grounds all tasks in precise PDDL-style (Planning Domain Definition Language) domains, enabling mechanical validation of outputs and encompassing diverse task types, representation styles, and domain distributions. It systematically combines natural-language templates, symbolic encodings, and ground-truth solutions, providing a reproducible suite tailored for zero-shot, one-shot, and few-shot evaluation protocols (Valmeekam et al., 2022).

1. Formal Structure and Problem Specification

PlanBench formulates each planning instance as a classical deterministic STRIPS-style planning problem: $P = \langle S,\,A,\,T,\,s_0,\,g \rangle$ where:

$S$ : Reachable state space, each state $s \subseteq F$ is a set of ground fluents.
$A$ : Finite set of actions, each equipped with $\mathit{pre}(a)$ —precondition set, and effects $\mathit{add}(a),\,\mathit{del}(a)$ .
$T: S \times A \rightarrow S$ : Deterministic transition function:

$T(s, a) = (s \setminus \mathit{del}(a)) \cup \mathit{add}(a) \quad \text{if } \mathit{pre}(a) \subseteq s, \text{ undefined otherwise.}$

$s_0$ : Initial state.
$g \subseteq F$ : Goal condition (conjunctive fluents).

A plan $\pi = [a_1, a_2, \dots, a_k]$ is valid if $T(\dots T(T(s_0, a_1), a_2)\dots, a_k)$ is defined and achieves $g$ (Valmeekam et al., 2024).

2. Domains, Instance Types, and Representations

The dataset encompasses five principal domains:

Blocksworld (600 instances),
Mystery Blocksworld (600),
Depots (500),
Logistics (285),
OD_Logistics (285).

Each instance is dual-encoded:

Natural-language template: Controlled English listing actions, preconditions, effects, initial facts, and the goal. Designed to minimize ambiguity while testing linguistic generalization and reasoning.
PDDL encoding: Canonical PDDL files for domains and problems; all actions are primitive operators without hierarchical constructs.

Additionally, "Obfuscated Mystery" tasks are included, where names are replaced with arbitrary tokens to evaluate robustness to symbol grounding (Valmeekam et al., 2022, Puerta-Merino et al., 22 Nov 2025, Valmeekam et al., 2024). Specific Blocksworld subsets introduce:

Extended Blocksworld (110 hard instances, 6–20 blocks, plans up to 40 steps),
Unsolvable Blocksworld (100 instances with contradictory goals).

3. Task Taxonomy and Curriculum

PlanBench's curriculum spans eight test types (each with automated prompt, completion, parse, and validation logic):

Plan Generation (zero/one/few-shot, cost-unconstrained),
Cost-Optimal Plan Generation,
Plan Verification (validity of a proposed plan),
Plan Execution Prediction (state after plan execution),
Replanning (after exogenous change mid-plan),
Plan Reuse (adapting/pruning plans for new goals),
Plan Generalization (refactoring based on operator or object changes),
Robustness to goal reformulation or domain obfuscation.

The auxiliary test sets include additional Blocksworld instances targeted at plan generalization, explicitly crafted to prevent memorization or surface-level pattern matching by LLMs (Valmeekam et al., 2022).

4. Evaluation Metrics and Protocols

PlanBench mandates strict mechanical validation of all plans and responses. Key metrics include:

Success Rate (SR) for plan generation:

$\mathrm{SR} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[\pi_i \text{ is valid}]$

Plan Cost:

$\mathrm{cost}(\pi) = \sum_{j=1}^{|\pi|} c(a_j)$

Optimality Ratio (OR):

$\mathrm{OR}(\pi) = \frac{\mathrm{cost}(\pi)}{\mathrm{cost}(\pi^*)}$

where $\pi^*$ is planner-computed optimal plan.

For unsolvable tasks: True Negative Rate (TNR), False Negative Rate (FNR). For plan verification and reasoning: proportion (%) of exact matching answers to validator ground-truth.

Cost and efficiency metrics (average runtime, average monetary cost per 100 instances) are tracked in recent studies (Valmeekam et al., 2024).

Experiments typically use zero-temperature sampling and direct translation of LLM outputs to formal plans, validated via VAL and Fast-Downward (or, for formal verification, NuSMV with LTL model checking) (Ramani et al., 3 Oct 2025).

5. Annotation Formats, Tooling, and Extensibility

Annotations consist of natural language prompts, PDDL domain and problem files, and, optionally, reference (ground-truth) plans. Annotation schema is standardized for each domain:

Initial state: set of fluents (e.g. “red block on table”).
Goal: conjunctive set of fluents.
Actions: complete lifted schema (name, parameters, precondition, effect).
Instance metadata: domain, identifier, reference solution.
For plan verification: binary valid/invalid label (possibly with error/failure type annotations).

Tooling is fully open-source; scripts auto-generate prompts, execute model queries, parse outputs, and validate plans. Researchers are encouraged to extend PlanBench via a modular directory structure, contributing new domains, templates, and validation scripts (Valmeekam et al., 2022, Valmeekam et al., 2024).

6. Applications and Recent Benchmarking Findings

Recent studies have focused on several application modalities:

Structured Planning with Symbolic Models: SymPlanner exploits PlanBench to ground LLM policies in symbolic state transition systems, leveraging iterative correction and contrastive ranking mechanisms to boost plan validity and diversity (Xiong et al., 2 May 2025).
HTN Modeling: L2HP applies LLMs to hierarchical planning over PlanBench, reporting parsing success around 36% in classical PDDL mode and notably low validity in hierarchical HDDL mode (~1%), highlighting benchmark inadequacy for structured HP tasks and the need for new hierarchical-annotated datasets (Puerta-Merino et al., 22 Nov 2025).
Plan Verification via Formal Methods: Natural language plans are automatically translated into Kripke structures and LTL formulas; GPT-5 achieves F1 = 96.3% on the plan verification subset, with almost perfect syntactic form but remaining semantic limits. Model checking analysis is central for bridging LLM output and formal guarantees (Ramani et al., 3 Oct 2025).
Evaluating Large Reasoning Models: OpenAI’s o1 (Strawberry) LRM demonstrates quantum improvements in PlanBench Blocksworld subsets, yet does not saturate benchmark accuracy; cost and runtime metrics are increasingly incorporated (Valmeekam et al., 2024).

7. Challenges, Limitations, and Extensions

PlanBench enforces strict mechanical validation, but presents several open challenges:

Ambiguity in Natural Language: Some instances risk underspecification without formal ground-truth anchoring; parallel PDDL encoding mitigates this, but not perfectly (Xiong et al., 2 May 2025).
Scalability and Difficulty: Increasing plan length (beyond 12–16 steps) or object cardinality leads to cascading error rates for LLMs and LRMs.
Hierarchical Planning: PlanBench lacks HTN-specific ground-truth, so extension to hierarchical domains requires new annotation paradigms (Puerta-Merino et al., 22 Nov 2025).
Efficient Evaluation: Original PlanBench did not measure cost-efficiency; recent work has introduced cost-per-instance and runtime measurements for model selection (Valmeekam et al., 2024).
Extensibility Protocols: Researchers should supply both zero/one-shot templates and PDDL encodings per new domain, as well as hard instances for scaling analysis.

PlanBench remains the canonical benchmark for assessing the planning and reasoning capabilities of language- and reasoning-based AI models—providing a reproducible, extensible foundation for current and future research in structured plan generation, verification, and formal evaluation (Valmeekam et al., 2022, Valmeekam et al., 2024, Puerta-Merino et al., 22 Nov 2025, Xiong et al., 2 May 2025, Ramani et al., 3 Oct 2025).