Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Dynamic Planning for LLM Agents

Updated 5 September 2025
  • Dynamic planning for LLM agents is a paradigm that enables systems to decide when to generate or update plans based on context and computational costs.
  • It employs integrated decision, planning, and acting policies refined through supervised fine-tuning and reinforcement learning for optimal long-horizon performance.
  • Empirical studies show that adaptive planning frequency enhances task success and human-in-the-loop controllability while reducing computational inefficiency.

Dynamic planning for LLM agents refers to computational approaches that enable LLM-driven systems to generate, update, and adapt plans for sequential decision-making in response to evolving environmental contexts, feedback, and internal reasoning signals. Distinct from static planning paradigms, dynamic planning endows agentic systems with flexible control over when and how to allocate computational resources for planning, how to replan in the face of uncertainty or failures, and how to integrate multi-modal, external, or experiential information into the ongoing plan execution cycle. Recent research formalizes dynamic planning as an interplay between planning frequency, computational cost, and task reward, aiming to optimize both agent behavior and resource expenditure in complex, long-horizon environments (Paglieri et al., 3 Sep 2025).

1. Formalization and Key Principles

The dynamic planning paradigm is defined by LLM agents’ ability to decide when to plan, how to plan, and how much effort to allocate toward planning at each decision point. The core technical framework decomposes agent behavior into three joint policy modules:

  • The decision policy ϕθ\phi_\theta determines, at each timestep, whether to generate a new plan—a binary decision variable dt{0,1}d_t \in \{0,1\}.
  • The planning policy ψθ\psi_\theta specifies the generation of a new plan ptp_t conditional on the current context ctc_t (state, history), previous plan pt1p_{t-1}, and flag dtd_t.
  • The acting policy πθ\pi_\theta produces an atomic action ata_t, conditioned on the current plan ptp_t and context ctc_t.

The plan update equation is given by:

pt=dtψθ(ptct,pt1)+(1dt)pt1p_t = d_t \cdot \psi_\theta(p_t \mid c_t, p_{t-1}) + (1 - d_t) \cdot p_{t-1}

The agent’s planning advantage is defined as

Aplan(ct)=Eptψθ(ct,dt=1)[Vπθ(ct,pt)Vπθ(ct,pt1)]A_{plan}(c_t) = \mathbb{E}_{p_t \sim \psi_\theta(\cdot \mid c_t, d_t=1)}\left[V^{\pi_\theta}(c_t, p_t) - V^{\pi_\theta}(c_t, p_{t-1})\right]

where Vπθ(ct,p)V^{\pi_\theta}(c_t, p) is the expected cumulative reward using plan pp from context ctc_t.

The explicit planning cost CplanC_{plan} accounts for additional token generation, inference latency, and behavioral instability when replanning:

Cplan=Ctokens+Clatency+CnoiseC_{plan} = C_{tokens} + C_{latency} + C_{noise}

The integrated RL objective to learn dynamic planning becomes:

θ=argmaxθEτθ[t=0Hγt(Rtask(st,at)dtCplan,t)]\theta^* = \arg\max_{\theta} \mathbb{E}_{\tau \sim \theta} \left[ \sum_{t=0}^{H} \gamma^t \left(R_{task}(s_t, a_t) - d_t \cdot C_{plan, t}\right) \right]

This encourages planning only when the expected gain outweighs the computational and behavioral cost (Paglieri et al., 3 Sep 2025).

2. Training Paradigms and Algorithms

Dynamic planning agents are typically trained using a two-stage pipeline:

(a) Supervised Fine-Tuning (SFT)

LLMs are first primed with diverse demonstration data containing interleaved natural language plans and actions. Demonstration traces include explicit planning intervals (e.g., “<plan>...<\plan> [Action]”). This phase teaches the model to recognize when planning is semantically appropriate and how to generate plans that condition future actions (Paglieri et al., 3 Sep 2025).

(b) Reinforcement Learning (RL)

After SFT, RL (typically PPO or similar) is used to refine the dynamic planning policy in a long-horizon, stochastic environment. The objective incorporates both the task reward and the cost of invoking the planner, explicitly teaching agents to optimize the “Goldilocks frequency”—planning only when beneficial. Empirically, RL-tuned dynamic planners demonstrate higher task success rates, more concise and effective plans, and sample efficiency gains (Paglieri et al., 3 Sep 2025).

3. Empirical Properties and Performance Tradeoffs

Performance and Resource Efficiency: Experimental studies in the Crafter environment—a procedurally generated, open-ended grid-world—show that “always planning” (as in ReAct (Paglieri et al., 3 Sep 2025)) is substantially less efficient and can degrade performance due to instability and computational overhead. Conversely, “never planning” limits performance by forgoing explicit future reasoning. Dynamic planners, trained to allocate planning adaptively, achieve superior sample efficiency and higher rates of complex achievement completion (e.g., obtaining rare objects) while using less total computation.

Planning Frequency: Zero-shot analysis reveals a non-monotonic relation between fixed planning interval and performance: both “overthinking” (excessive planning) and “underplanning” are suboptimal. RL-trained agents converge toward an intermediate, state-conditional replanning frequency, thereby maximizing planning advantage and return.

Controllability and Human-in-the-Loop Steering: Beyond autonomous adaptation, agents fine-tuned under the dynamic planning framework can be steered online via injection of human-written plans. In controlled experiments, agents guided by external plans accomplish complex objectives not achieved autonomously, showing the framework’s support for external controllability—critical in real-world deployments (Paglieri et al., 3 Sep 2025).

4. Relations to Prior Agentic Architectures

The dynamic planning formalism generalizes and unifies several earlier practices:

  • ReAct-style prompt planning: ReAct (Paglieri et al., 3 Sep 2025, Huang et al., 5 Feb 2024) plans before every action. Dynamic planning encompasses such behavior as the “always-plan” mode but demonstrates that selective planning yields superior outcomes.
  • Reflection/Memory-Augmented Planning: Integration with memory or reflection modules (e.g., “self-refine,” retrieval of past plans) is compatible with the dynamic planning principle and can further enhance long-horizon reasoning, but the key advancement is learning when (and when not) to invoke additional planning logic (Huang et al., 5 Feb 2024).
  • External Guidance: Ability to integrate explicit, high-level strategy from external sources aligns with the “steering” capability observed in dynamic planners.

5. Limitations, Challenges, and Open Questions

Instability Under Excessive Planning: Frequent replanning induces behavioral noise, backtracking, and computational bloat. Training must penalize instability and encourage plan persistence when appropriate.

Planning Cost Estimation: Explicitly modeling and balancing token, latency, and behavioral costs is nontrivial. Choosing cost weights is task-dependent and requires empirical calibration.

Scalability and Generalization: While demonstrated in controlled benchmarks (e.g., Crafter), the approach must be validated in more open-ended, real-world settings (e.g., dynamic webs, robotics, or multi-agent tasks).

Connection to Hierarchical and Multi-Agent Planning: Current dynamic planning frameworks assume a single agent. Extending to hierarchical, multi-level, or multi-agent settings—where global and local planning interact—remains an open research direction.

6. Implications for Agentic System Design

Dynamic planning for LLM agents provides a theoretical and empirical solution to the challenge of allocating reasoning compute in long-horizon, partially observable environments. Its formalism enables agents to:

  • adapt planning computation to task complexity and environmental volatility,
  • reduce both resource expenditure and behavioral variance,
  • support human-in-the-loop guidance for controllability,
  • and flexibly integrate with broader agentic architectures such as reflection, memory, and external symbolic planners.

Emerging research suggests that the “Goldilocks” principle—planning neither too frequently nor too rarely—should anchor best practices for the design and training of scalable, real-world LLM agents (Paglieri et al., 3 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Dynamic Planning for LLM Agents.