Multi-turn Agentic Planning
- Multi-turn agentic planning is a framework where agents use sequential decision making in partially observed, tool-rich environments.
- It leverages formal models like POMDPs and multi-turn blueprints to structure action selection, simulation, and iterative verification.
- Recent approaches use reinforcement learning, memory modules, and modular pipelines to optimize long-horizon planning and improve benchmark results.
Multi-turn agentic planning refers to the set of principles, architectures, and learning methodologies by which artificial agents—typically LLMs or their derivatives—manage complex, goal-directed tasks interacting over multiple turns within partially observed, tool-rich environments. This paradigm is central to contemporary research in tool-augmented dialogue, interactive task completion, and the development of robust, feedback-driven agent behaviors across diverse domains.
1. Formal Modeling and Task Blueprints
Multi-turn agentic planning is fundamentally modeled as a partially observable Markov decision process (POMDP) or Markov decision process (MDP), in which, at each interaction turn, an agent observes a partial state (typically, the dialogue history and output of external tools), selects an action (often a tool call or a natural language utterance), and receives (possibly sparse) feedback. Formally, the agent’s planning problem is posed as
where:
- : user intents;
- : environment state;
- : agent actions (inc. tool calls and responses);
- : observations;
- : transition dynamics;
- : reward function.
Planning proceeds over a finite (or occasionally open-ended) horizon of turns , with the agent constructing policies of the form given dialogue history . Advanced data generation methods such as APIGen-MT introduce the concept of a “multi-turn task blueprint” 0, capturing high-level user intent, a sequence of verifiable ground-truth tool calls, and expected task outputs (Prabhakar et al., 4 Apr 2025).
2. Agentic Planning Frameworks and Pipelines
A variety of agentic planning frameworks have been proposed to address data quality, environment feedback, and verification in multi-turn regimes:
- Blueprint Generation and Human-Agent Simulation: APIGen-MT employs a two-phase approach—(1) blueprint generation via a committee-driven, feedback-looped LLM pipeline, and (2) simulation of realistic agent–user interplay, with human prompts sampled and critiqued turn-wise, leading to verified, high-diversity multi-turn trajectories. This process includes a rigorous review-committee mechanism assigning a committee alignment score 1 to blueprints and iterative blueprint refinement (Prabhakar et al., 4 Apr 2025).
- Propose→Execute→Verify→Refine Loops: In domains like multi-turn Text-to-SQL, MTSQL-R1 frames planning as an MDP wherein the agent cycles through proposing SQL, execution, verification (both execution correctness and coherence with dialogue memory), and refinement steps until a final, verified solution is committed. Programmatic policy objectives in this domain combine outcome-level and process-level rewards to facilitate credit assignment and semantic robustness (Guo et al., 12 Oct 2025).
- Modular Agentic Systems: AgentFlow decomposes the multi-turn loop into planner, executor, verifier, and generator modules, each operating in-the-flow with shared memory. Trajectory-level outcome rewards are broadcast to all turns, yielding tractable credit assignment and robust planner optimization under long-horizon scenarios (Li et al., 7 Oct 2025).
- Multi-Turn Data Generation: Non-autoregressive iterative generation frameworks (e.g., ToolACE-MT) improve sample efficiency for trajectory construction by alternating between coarse initialization, iterative complexity and reasonability refinement, and rigorous verification. This contrasts with costly auto-regressive simulator rollouts (Zeng et al., 18 Aug 2025).
3. Challenges in Long-Horizon Multi-Turn Planning
Multi-turn planning is challenged by delayed and sparse rewards, credit assignment over extended horizons, error propagation, and the need for accurate state tracking and memory management:
- Sparse Terminal Reward and Credit Assignment: Many domains dispense rewards only at episode end, necessitating sophisticated strategies to assign credit to decisions made throughout the trajectory. Recent methods such as SLEA-RL introduce step-level advantage estimation and step-conditioned retrieval of episodic experiences, leveraging dynamic libraries of strategies and failure cases to guide policy updates (Wang et al., 18 Mar 2026).
- Segmental and Hindsight Credit Assignment: HISR proposes segment-level process reward models, where trajectories are decomposed into sub-goal segments. Segment rewards are modulated by importance scores derived from hindsight models, constructed by contrasting policy- and hindsight-conditioned likelihoods for each action, ensuring credit is focused on actions that matter most post-hoc (Lu et al., 19 Mar 2026).
- Scalability and Realism in Benchmarking: Benchmarks such as COMPASS and TravelBench are designed to expose the trade-offs in constraint satisfaction, preference optimization, and plan–coordination across complex multi-service domains, with controlled tool ecosystems and dynamic user simulators enforcing rigorous, realistic evaluation protocols (Qin et al., 8 Oct 2025, Cheng et al., 27 Dec 2025).
4. Policy Optimization Strategies
Policy optimization in multi-turn agentic planning leverages advanced reinforcement learning (RL) methodologies:
- Turn-Level PPO and Tree-Based Exploration: Token-level RL approaches struggle with heterogeneous transition steps and unstable advantage estimation. Turn-PPO reframes the sequence as a turn-level MDP, stabilizing value estimation, enabling lower-variance advantage estimation, and improving update granularity (Li et al., 18 Dec 2025). AT2PO further exploits a turn-based tree structure, with entropy-guided exploration and turn-wise reward back-propagation to balance exploration and credit assignment (Zong et al., 8 Jan 2026).
- Group-Relative and Cross-Task Normalization: Methods such as GRPO and task advantage normalization normalize reward signals within trajectory groups or tasks, stabilizing policy gradients and facilitating scalable, multi-task, multi-turn training in frameworks such as AgentRL (Zhang et al., 5 Oct 2025).
- Single-Turn RL for Multi-Turn Generalization: Reformulating multi-turn tasks as single-turn reasoning problems (using dense, verifiable rewards from expert trajectories) allows efficient policy optimization that provably amplifies minimal-turn multi-turn success and generalization to subtasks, as demonstrated both theoretically and empirically (Hu et al., 24 Sep 2025).
5. Structured Representations, Memory, and Tool Use
Effective multi-turn agentic planners rely on explicit representations of action dependencies, real-time memory, and coordination of tool invocations:
- Explicit DAG Planning: OrchDAG models multi-turn tool orchestration as plan directed acyclic graphs, enabling dense structural rewards via Graph Edit Distance and explicit credit over parallel or dependent tool calls (Lu et al., 28 Oct 2025).
- Caching, Memory, and Replanning: Systems such as T1-Agent deploy integrated short/long-term caching mechanisms and dynamic replanning strategies, where tool call reuse vs. recomputation decisions optimize latency and enforce inter-tool dependencies under multi-domain settings (Chakraborty et al., 22 May 2025).
- Experience Libraries for Planning: SLEA-RL and related approaches maintain evolving libraries of successful and failed trajectories, structuring retrieval by clustered environment observations at each turn to render experience-based planning both efficient and adaptive (Wang et al., 18 Mar 2026).
6. Quantitative Results, Model Scaling, and Limitations
Empirical results consistently demonstrate that:
- Purpose-built small to midsize models trained on structured, verifiable multi-turn data sometimes outperform much larger, generic LLMs on multi-turn, tool-augmented planning tasks, likely due to better data absorption, improved calibration, and reduced overfitting (Prabhakar et al., 4 Apr 2025).
- State-of-the-art agentic pipelines yield substantial accuracy improvements on challenging multi-turn benchmarks such as BFCL v3, τ-bench, and ACEBench. For instance, xLAM-2-8b-fc-r achieves 69.25% on multi-turn (BFCL v3), surpassing GPT-4o-FC at 41% (Prabhakar et al., 4 Apr 2025).
- RL algorithms exploiting turn-, segment-, or step-level granularity consistently outperform naive or token-level approaches, especially as the planning horizon and action complexity increase (Li et al., 18 Dec 2025, Zong et al., 8 Jan 2026, Lu et al., 19 Mar 2026, Wang et al., 18 Mar 2026).
- Benchmarks reveal persistent gaps in optimality and plan coordination, especially for open-source models tasked with multi-service temporal and budget constraint satisfaction (Qin et al., 8 Oct 2025, Cheng et al., 27 Dec 2025).
Remaining limitations involve credit assignment in truly long-horizon or compositional tasks, reliance on high-quality blueprint or expert data, limitations of simulated user models, and imperfect state tracking in highly dynamic, partially observable settings.
7. Outlook and Future Research Directions
Key future directions highlighted in the literature include:
- Integration of end-to-end differentiable planners and “learned oracles” for internalized, robust long-horizon reasoning, as opposed to externally provided hints or blueprints (Rakhsha et al., 23 Jan 2026).
- Extension of reward modeling toward dense, process-aligned, and segment/hindsight-modulated strategies to accelerate learning and minimize reward propagation delay (Lu et al., 19 Mar 2026).
- Fine-grained memory architectures and retrieval-based policy augmentation, leveraging step-level observation clustering and experience distillation (Wang et al., 18 Mar 2026).
- Expansion of toolsets and environment simulators to encompass broader real-world domains, multi-agent collaboration, and real-time interaction in non-synthetic environments (Cheng et al., 27 Dec 2025, Qin et al., 8 Oct 2025).
- Methodological innovation in hierarchical, modular, and agentic system design, supporting both global plan sketching and distributed, concurrent action optimization.
Collectively, these advances define multi-turn agentic planning as a convergent field at the intersection of LLM-based reasoning, RL, and structured sequential decision making, with a research trajectory grounded in meticulous benchmarking, algorithmic development, and deployment-focused system architecture.