Plan-then-Execute LLM Agents
- Plan-then-Execute systems are architectures that explicitly decouple high-level planning from execution, using structured plans such as linear lists, hierarchies, or DAGs.
- They employ multi-stage workflows with supervised fine-tuning, preference optimization, and synthetic data generation to ensure high-quality plan formation.
- This paradigm enhances control-flow integrity, auditability, and failure recovery, making it ideal for complex, long-horizon, and high-assurance applications.
A plan-then-execute (PTE) paradigm in LLM agents refers to architectures that decouple high-level planning—producing an explicit, global or hierarchical plan—from the downstream execution of that plan, which is typically handled by a separate module or subagent. This separation is motivated by the need for interpretability, modularity, controllable reasoning, reliability, and the limitations of end-to-end or interleaved planning/acting (e.g., ReAct) protocols in complex, long-horizon, or high-assurance environments.
1. Formal Foundations and Core Architecture
The canonical PTE agent leverages a two-stage workflow: first, a planner module creates an explicit plan—a sequence (or DAG) of high-level steps or subgoals—based on a user goal or instruction; then, an executor module realizes each step, typically by grounding it to environment actions, tool calls, or subagent invocations.
Let denote the (possibly natural-language) user instruction. The planner emits a plan
For execution, an agent is conditioned on both and , producing action sequences: where and are the histories of actions and observations. The plan may be represented as a strictly linear list (Xiong et al., 4 Mar 2025), a hierarchical structure (Chen et al., 23 Apr 2025), a DAG (Zhang et al., 12 Mar 2026), a program (“blueprint”) (Qiu et al., 1 Aug 2025), or segmentable subgoal set suitable for parallel/expert worker assignment (Amayuelas et al., 2 Apr 2025, Toda et al., 11 Jan 2026). The execution phase iterates through, or schedules, these steps while updating state and potentially relaying feedback for re-planning.
This explicit decoupling stands in contrast to interleaved (e.g., ReAct (Huang et al., 2024)) or monolithic LLM-only agent architectures, which conflate reasoning and acting on a per-timestep basis.
2. Plan Representation: Abstractions and Modality
Plan representations in PTE architectures are flexible but always explicit. Common forms include:
- Abstract NL step lists: Each step is a natural-language description of a subgoal or action, often omitting environment-specific details to enhance generalization (Xiong et al., 4 Mar 2025, Erdogan et al., 12 Mar 2025).
- Structured JSON or code blueprints: Plans may manifest as structured JSON objects, programs (functions in Python or task DSLs), or DAGs of subtasks (Qiu et al., 1 Aug 2025, Zhang et al., 12 Mar 2026, Amayuelas et al., 2 Apr 2025).
- Symbolic plans (PDDL): Hierarchical, symbolic representations where steps are parameterized (e.g., as actions with pre- and postconditions) provide strong modularity and support for formal verification (Aghzal et al., 15 Mar 2026).
Plans may be static (fixed before any execution) or support continuous refinement (replanned after each step or failure) (Chen et al., 23 Apr 2025, Erdogan et al., 12 Mar 2025). Advanced frameworks allow plans to specify agent, tool, or subskill invocation, and may encode dependencies (e.g., DAG structure for parallel execution) (Zhang et al., 12 Mar 2026, Toda et al., 11 Jan 2026).
3. Planning, Optimization, and Data Regimes
PTE agents require not only robust planning modules, but learning frameworks to ensure high-quality and generalizable plans.
- Supervised Finetuning: Planners are initially trained on pairs or of instructions and reference plans using cross-entropy loss (Erdogan et al., 12 Mar 2025).
- Preference Optimization: Meta plans are further refined using feedback from rollouts and environment rewards. Direct Preference Optimization (DPO) or similar preference-learning objectives optimize the planner to prefer plans yielding higher empirical rewards (Xiong et al., 4 Mar 2025).
- Synthetic Data Generation: To address data scarcity—especially of high-quality plans—researchers devise pipelines that (a) collect trajectories via demonstrators, (b) annotate with LLM or programmatic plan labels, and (c) perform augmentation and targeted expansion (Erdogan et al., 12 Mar 2025).
- Feedback Loops: Environmental or LLM-based verification signals inform meta-plan improvement and dynamic re-planning (Chen et al., 23 Apr 2025, Zhang et al., 12 Mar 2026).
Optimization objectives thus blend supervised learning, preference learning, and closed-loop fine-tuning leveraging both human-annotated and synthetic datasets.
4. Execution, Conditioned Reasoning, and Failure Recovery
Execution modules consume explicit plans and ground them into actions. Execution protocols include:
- Conditioned prompting: The plan is inserted into the reasoning context for each action, with empirical results showing performance is sensitive to placement (e.g., instruction block vs. internal thought) (Xiong et al., 4 Mar 2025).
- Structured execution: In deterministic or secure settings, plans are codified as source code or a blueprint, and executed stepwise by an engine that blocks on each atomic call; LLM invocations, tool calls, and conditionals are interleaved as dictated by the static plan, never the LLM at runtime (Qiu et al., 1 Aug 2025).
- Hierarchical skill dispatch: High-level plan steps are mapped to specialized skill modules (searching, coding, writing, etc.), with each skill having a designated executor and interface (Chen et al., 23 Apr 2025).
- Multi-agent and parallel models: Plans may specify assignment of tasks to multiple worker agents, supporting efficient, event-triggered concurrent execution (Amayuelas et al., 2 Apr 2025, Toda et al., 11 Jan 2026, Zhang et al., 12 Mar 2026).
Failure recovery is intrinsic to robust PTE. When off-nominal outcomes (errors, unexpected observations) arise:
- Agents may trigger local re-planning (spot correction of suffix steps or subgoals), global re-plan (entire plan regeneration from new state), or adaptive assignment of remedial steps (Chen et al., 23 Apr 2025, Aghzal et al., 15 Mar 2026, Wang et al., 2024).
- LLM-based or deterministic verification modules judge postconditions and propagate learned failures back to the planner (Aghzal et al., 15 Mar 2026).
5. Security, Interpretability, and Control
PTE patterns provide prominent architectural advantages:
- Control-flow integrity: Fixing the plan up-front prevents tool outputs or environmental feedback from injecting unanticipated actions. This hardens against prompt-injection and non-local vulnerabilities (Rosario et al., 10 Sep 2025).
- Least-privilege enforcement: By associating tools with plan steps, executors can be dynamically provisioned for minimal access, and sandboxed as needed (e.g., per step Docker containers) (Rosario et al., 10 Sep 2025).
- Determinism and procedural fidelity: Encoding plans as source-code or blueprints guarantees procedural adherence, with all stochasticity confined to controlled LLM submodule invocations (Qiu et al., 1 Aug 2025).
- Human-in-the-loop gating: PTE simplifies HITL verification—humans can approve the plan before execution or per critical step (Rosario et al., 10 Sep 2025).
- Auditability: Traceable plan structures and execution logs facilitate analysis of failures and ground reasons for specific actions or outcomes (Toda et al., 11 Jan 2026).
6. Empirical Results and Quantitative Comparison
Empirical studies demonstrate significant benefits for PTE architectures:
| Agent/Framework | Success Rate / Key Gains | Benchmark | Notes |
|---|---|---|---|
| MPO (Xiong et al., 4 Mar 2025) | +18.3 pts (zero-shot), +5% SOTA margin | ScienceWorld, ALFWorld | Strong generalization, reduced wasted actions |
| Plan-and-Act (Erdogan et al., 12 Mar 2025) | Static: +6–10 pp; Dynamic: +16–20 pp | WebArena-Lite | Synthetic data, dynamic re-planning vital |
| Source Code Agent (Qiu et al., 1 Aug 2025) | +10.1 pp Pass1 over baseline | tau-bench | Up to –82% token/tool cost |
| GoalAct (Chen et al., 23 Apr 2025) | Mean +12.22 pp avg. | LegalAgentBench | Ablations: planning, searching, coding all critical |
| Planner, Multi-agent (Amayuelas et al., 2 Apr 2025) | 5.53 vs 2.28 efficiency ratio | CuisineWorld | Improved agent utilization, cost efficiency |
| CHASE (Toda et al., 11 Jan 2026) | 98.4% recall, 0.08% FPR | PyPI malware (3k pkgs) | LLM-coordinated multi-agent with verification |
| VMAO (Zhang et al., 12 Mar 2026) | 3.1→4.2 completeness (+35%) | Market research (25 queries) | Orchestration-level verification, iterative plan |
Plan-then-Execute yields (a) higher final task and subgoal success, (b) reduced execution overhead and error rate, and (c) greater generalization—particularly evident for mid-scale models and multistep or multi-agent tasks (Xiong et al., 4 Mar 2025, Erdogan et al., 12 Mar 2025, Qiu et al., 1 Aug 2025, Amayuelas et al., 2 Apr 2025, Zhang et al., 12 Mar 2026).
7. Limitations, Open Challenges, and Future Directions
Open challenges remain across several dimensions:
- Hallucination and grounding: High-level planners still hallucinate infeasible subgoals or objects; symbolic or structured planning formats (PDDL, code) mitigate, but do not eliminate, these errors (Wei et al., 16 Feb 2025, Aghzal et al., 15 Mar 2026).
- Perceptual and low-level bottlenecks: Even with perfect plans, execution struggles with UI grounding, selector identification, DOM ambiguity, or tool parameterization (Aghzal et al., 15 Mar 2026).
- Context- and domain-adaptation: Robust generalization and adaptation to new tasks require dynamic memory hacks, curriculum-based plan dataset expansion, and possibly hybrid neural-symbolic memory modules (Erdogan et al., 12 Mar 2025, Hu et al., 10 Feb 2026).
- Cost and latency: While PTE improves marginal cost per execution, long plans coupled with environment feedback still impose steep call and inference costs (Rosario et al., 10 Sep 2025, Hu et al., 10 Feb 2026).
- Human trust in daily assistants: Naïve user involvement in PTE agents does not reliably calibrate trust or improve outcome; adaptive HITL involvement and transparency in model uncertainty remain areas of active investigation (He et al., 3 Feb 2025).
- Evaluation: New process-level and trajectory metrics—beyond end-to-end success—are needed to reveal misalignments in plan, execution, and recovery stages (Shahnovsky et al., 13 Mar 2026, Aghzal et al., 15 Mar 2026).
Future research directions include hierarchical decomposition protocols, principled re-planning strategies, uncertainty-driven human-in-the-loop controls, and scalable multi-agent orchestration in dynamic environments.
Key references: (Xiong et al., 4 Mar 2025, Erdogan et al., 12 Mar 2025, Qiu et al., 1 Aug 2025, Chen et al., 23 Apr 2025, Amayuelas et al., 2 Apr 2025, Toda et al., 11 Jan 2026, Huang et al., 2024, Wei et al., 16 Feb 2025, Rosario et al., 10 Sep 2025, Shahnovsky et al., 13 Mar 2026, Castrillo et al., 10 Oct 2025, Hu et al., 10 Feb 2026, Aghzal et al., 15 Mar 2026, Zhang et al., 12 Mar 2026, He et al., 3 Feb 2025, Wang et al., 2024).