Multi-Turn Tool-Augmented Itinerary Planning

Updated 4 February 2026

Multi-turn tool-augmented itinerary planning is a method that iteratively constructs adaptive travel plans using interactive dialogues and external tool outputs.
It employs constrained optimization, context retention, and real-time tool orchestration to refine itineraries in response to evolving user needs.
Key methodologies include retrieval-augmented generation, graph-structured coordination, and reinforcement learning for efficient and robust planning.

Multi-turn tool-augmented itinerary planning encompasses a class of AI systems and methodologies that iteratively construct, revise, and optimize travel itineraries through conversational interaction, progressively leveraging a dynamic ecosystem of external tools (e.g., booking APIs, route planners, knowledge bases). Unlike static, single-turn approaches, these systems maintain and adapt state across multiple dialogue rounds, incorporating both evolving user intent and real-time tool outputs. The field is sharply focused on achieving context retention, constraint satisfaction, preference optimization, robustness to environmental dynamics, and high-quality user interaction under the constraints of finite memory and external tool variability.

1. Formal Problem Definition and Computational Frameworks

The central computational problem is the synthesis of an itinerary—comprised of actions such as bookings, visits, and route decisions—across multiple dialogue turns, in the presence of external tool APIs and evolving user constraints.

Constrained Optimization Perspective

Recent research conceptualizes the task as constrained optimization over potential itineraries $\mathbf{x}$ , where hard constraints $g_i(\mathbf{x}) \le 0$ (cost, time, logistics, legal requirements) must be satisfied, and soft user preferences $u_j(\mathbf{x})$ (amenity quality, activity diversity, etc.) are aggregated into a utility $U(\mathbf{x}) = \sum_j w_j u_j(\mathbf{x})$ for maximization (Qin et al., 8 Oct 2025). The agent must solve:

$\underset{\mathbf{x}}{\mathrm{maximize}}\quad U(\mathbf{x}) = \sum_j w_j u_j(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) \le 0\,\,\forall i$

Tool invocation is modeled as sequence generation or action selection over a discrete or hybrid action space defined by the system's tool inventory (Soni et al., 5 Jun 2025, Cheng et al., 27 Dec 2025). Systems often support both atomic and composite tool calls, where composite actions expand the planning space and can be dynamically blocked or cost-perturbed, requiring real-time replanning (Liu et al., 4 Nov 2025).

State, Action, and Observation Modeling

At each turn $t$ , the system state $s_t$ includes the current dialogue context, available tools and their costs, existing data artifacts (sub-results, bookings), and user constraints (Liu et al., 4 Nov 2025). Actions map to concrete tool invocations or final output steps. Observations correspond to the structured tool responses, which may include lists of POIs, query results, weather forecasts, or booking confirmations (Ning et al., 26 Sep 2025, Li et al., 8 Dec 2025, Hu et al., 31 Dec 2025).

2. Multi-Turn Dialogue and Tool-Orchestration Strategies

Dialogue Management

Multi-turn system architectures maintain persistent dialogue state, incrementally eliciting missing parameters (dates, departure cities, budgets), and tracking unsatisfied constraints $\mathcal{C}$ throughout the process (Cheng et al., 27 Dec 2025). Explicit clarification is prioritized in early turns for feasible plan synthesis and to minimize downstream rework.

Tool Integration and Invocation

Modern systems employ one of the following orchestration mechanisms:

Retrieval-Augmented Generation (RAG) with Dynamic Context Tuning (DCT): Maintains an attention-based embedding cache of past user intents and tool invocations; uses LoRA-adapted retrievers for selecting contextually relevant APIs at each turn; applies context compression (e.g., BiLSTM-CRF with controlled summarization) to enable long-horizon planning within LLM context limitations (Soni et al., 5 Jun 2025).
Graph-Structured and Memory-Augmented Agents: Construct a tool dependency graph (SIT-Graph), annotating edges with state-conditional episodic summaries and procedural weights; decision gates fuse procedural regularities and episodic recall to drive tool selection, supporting both context-aware branching and routine execution (Li et al., 8 Dec 2025).
Multi-Agent Coordination: Decomposes user interaction, information retrieval, recommendation, route optimization, and plan refinement into separate agents sharing state via a graph structure, with agents subscribing to subgraph changes to support event-driven, modular dialogue and plan update (Liu et al., 16 May 2025).
Reinforcement Learning (RL) for Policy Optimization: Trains LLM policies using hierarchical reward models (e.g., trajectory- and turn-level verifiers) to autonomously determine when and how to use tools based on long-term outcome maximization, with explicit replay of difficult cases to achieve robustness (Ning et al., 26 Sep 2025, Hu et al., 31 Dec 2025).

3. Core Algorithms: Planning, Adaptation, and Context Management

Planning and Plan Revision

Algorithmic pipelines decomposed in contemporary systems include:

Caching and Retrieval: Key-value attention caches store embeddings and metadata from previous dialogue turns, supporting relevant context extraction via recency- and similarity-weighted attention (Soni et al., 5 Jun 2025).
Domain-Adaptive Retrieval: LoRA-layered retrievers rapidly adapt tool selection to shifting preferences or environments without retraining base parameters (Soni et al., 5 Jun 2025).
Context Compression: BiLSTM-CRF-based sequence taggers identify and compress key spans (tools, entities, temporals), enforcing a budgeted context length for LLM prompts; greedy selection maximizes critical content (Soni et al., 5 Jun 2025).
Automated Planning and Search: Hybrid frameworks such as TRIP-PAL convert LLM-extracted information and user constraints into formal planning problems (e.g., PDDL), guaranteeing constraint satisfaction and optimality via off-the-shelf solvers (Fast Downward A* + LM-cut) (Rosa et al., 2024).
Cost-Optimal Sequence Search: Agents enumerate feasible atomic/composite tool-call sequences, explicitly compute total cost using up-to-date cost tables, and select cost-minimizing plans; upon environmental change (blocking events), plans are adaptively recomputed from the current execution state (Liu et al., 4 Nov 2025).
RL Policy Optimization: Agentic RL architectures maximize final task reward using PPO- or GRPO-style objectives, augmented with failure replay to overcome sparse reward regimes (Ning et al., 26 Sep 2025, Hu et al., 31 Dec 2025).

Algorithmic Skeleton (DCT example):

for each turn t:
    q_t = encode(user_utterance)
    c_t = attention_cache_retrieve(q_t)
    if context_length_exceeds:
        compressed = context_compress(full_context)
    tools = LoRA_retrieve_tools(q_t, c_t)
    llm_response = LLM(prompt_with_context, tools)
    if llm_response indicates_tool_call:
        tool_result = invoke_tool
    update_cache(q_t + llm_response)

(Soni et al., 5 Jun 2025)

4. Tool and Data Ecosystem

Tool Taxonomy and APIs

State-of-the-art systems expose a battery of domain-specific APIs, generally structured to allow chained, parameterizable invocations. Representative tools include:

Category	Tools/Functions	Outputs
Accommodation	search_hotel, get_hotel_details, book_hotel	HotelList, Confirmation
Transportation	search_flights, get_flight_details, book_flight	FlightList, Confirmation
POI/Attraction	map_search_places, map_search_ranking_list	POI Lists
Weather	weather_current_conditions, weather_forecast_days	Weather summaries
Utilities	notebook, calendar, recommend_<type>	e.g., package IDs

Datasets span real and synthetic travel queries, cached tool outputs, multiday weather patterns, and structured booking constraints. Benchmarks such as COMPASS and TravelBench provide scalable, reproducible evaluation testbeds with hundreds to thousands of dialogue scenarios and deterministic sandbox execution (Qin et al., 8 Oct 2025, Cheng et al., 27 Dec 2025).

5. Evaluation Protocols and Empirical Findings

Metrics

Multi-turn itinerary planning systems are assessed using a range of metrics, including:

Planning Accuracy: AST-Match ≥ 90%, Step-Edit-Distance ≤ 1.2, Itinerary Coverage F1 (Soni et al., 5 Jun 2025, Cheng et al., 27 Dec 2025).
Retrieval and Tool Selection: Recall@5 ≥ 82%, NDCG@5 ≥ 0.75 (Soni et al., 5 Jun 2025).
Constraint Satisfaction Rate (CSR): Proportion of user- or system-imposed constraints met in final itinerary (Cheng et al., 27 Dec 2025).
Cost-Gap and Edit Distance: Exact match ratio, normalized edit distance in cost-aware planning (Liu et al., 4 Nov 2025).
Hallucination Rate: Tools mentioned but not executed, with strong frameworks achieving ≤0.6% per turn (Soni et al., 5 Jun 2025).
Efficiency: Context compression yields average ≤1,100 tokens, average latency ≤400 ms/turn (Soni et al., 5 Jun 2025).

Benchmarks and Comparative Results

DCT shows 14% plan accuracy improvement and a 37% hallucination reduction over static RAG (AST-Match rises from 78% to 91% across ablation; 0.6% hallucination rate) (Soni et al., 5 Jun 2025).
On TravelBench, leading models such as STAgent achieve 66.6% raw multi-turn score, outperforming strong baselines by 7+ points, and retain high generalization across non-travel domains (Hu et al., 31 Dec 2025, Cheng et al., 27 Dec 2025).
CostBench reveals substantial cost-sensitivity deficiencies; even advanced models fail to maintain optimality under dynamic event perturbation, with GPT-5's EMR dropping from 95.5% (static) to ≈35% (cost-change scenario) (Liu et al., 4 Nov 2025).
Key behavioral gaps identified include 'acceptable-optimal gap' (agents meet constraints but do not maximize utility) and 'plan-coordination gap' (degraded performance in multi-entity, multi-constraint settings) (Qin et al., 8 Oct 2025).

6. Representative Architectures and System Implementations

DCT orchestrates itinerary planning via:

GTR-T5-XL encoder for utterance embedding
Attention-based recency- and similarity-weighted context cache
LoRA retriever for domain- and preference-adaptive tool selection
BiLSTM-CRF context compression and summarization (enforcing critical entity and parameter retention)
LLM-based plan step and tool-call generation with API execution and feedback loop

Vaiage employs dedicated LLM agents for conversation state, information retrieval, recommendation, routing, and plan refinement, coordinated via a central TravelGraph. Each agent monitors the graph for relevant updates and operates asynchronously, yielding robust adaptation to user constraints and tool outputs.

DeepTravel and STAgent leverage PPO/GSPO objectives and hierarchical verifiers to induce tool selection and plan synthesis policies, enabling small LLMs to outperform larger off-the-shelf models via environment-intensive, experience-replay-backed RL.

7. Open Challenges and Research Frontiers

Primary challenges in multi-turn tool-augmented itinerary planning include:

Robust cross-tool coordination: Performance sharply declines as plan complexity increases (hotel+flight+permit) and constraints/transitions become more entangled (Qin et al., 8 Oct 2025, Cheng et al., 27 Dec 2025).
Economic optimality and adaptation: Resource- and cost-aware planning remains unsolved under dynamic or adversarial conditions (Liu et al., 4 Nov 2025).
State representation: Continual balance between rich episodic memory and procedural efficiency is addressed by hybrid memory architectures such as SIT-Graph (Li et al., 8 Dec 2025).
Long-Context Truncation: Compression and summarization must preserve essential constraints, parameters, and plan-specific entities, as context windows are strictly bounded (Soni et al., 5 Jun 2025).
Scalability: Effective training and curation require needle-in-haystack data processing, filtering 1:10,000 training samples for diverse, high-difficulty cases (Hu et al., 31 Dec 2025).

A coherent trend is the fusion of LLM-based reasoning with symbolic, structured, or RL-backed policy components, grounded in tool-mediated, reproducible benchmarks and empirical ablation analysis (Rosa et al., 2024, Cheng et al., 27 Dec 2025). Research directions aim at narrowing the acceptable-optimal gap, enhancing plan provenance and explainability, and achieving robust, economically rational itinerary planning at scale.

Markdown Report Issue Upgrade to Chat

References (9)

COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization (2025)

Dynamic Context Tuning for Retrieval-Augmented Generation: Enhancing Multi-Turn Planning and Tool Adaptation (2025)

TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning (2025)

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents (2025)

DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents (2025)

SIT-Graph: State Integrated Tool Graph for Multi-Turn Agents (2025)

AMAP Agentic Planning Technical Report (2025)

Vaiage: A Multi-Agent Solution to Personalized Travel Planning (2025)

TRIP-PAL: Travel Planning with Guarantees by Combining Large Language Models and Automated Planners (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Turn Tool-Augmented Itinerary Planning.

Multi-Turn Tool-Augmented Itinerary Planning

1. Formal Problem Definition and Computational Frameworks

Constrained Optimization Perspective

State, Action, and Observation Modeling

2. Multi-Turn Dialogue and Tool-Orchestration Strategies

Dialogue Management

Tool Integration and Invocation

3. Core Algorithms: Planning, Adaptation, and Context Management

Planning and Plan Revision

Algorithmic Skeleton (DCT example):

4. Tool and Data Ecosystem

Tool Taxonomy and APIs

5. Evaluation Protocols and Empirical Findings

Metrics

Benchmarks and Comparative Results

6. Representative Architectures and System Implementations

Dynamic Context Tuning (DCT) (Soni et al., 5 Jun 2025)

Multi-Agent Graph-Structured Systems (Liu et al., 16 May 2025)

RL-Based End-to-End Agents (Ning et al., 26 Sep 2025, Hu et al., 31 Dec 2025)

7. Open Challenges and Research Frontiers

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-Turn Tool-Augmented Itinerary Planning

1. Formal Problem Definition and Computational Frameworks

Constrained Optimization Perspective

State, Action, and Observation Modeling

2. Multi-Turn Dialogue and Tool-Orchestration Strategies

Dialogue Management

Tool Integration and Invocation

3. Core Algorithms: Planning, Adaptation, and Context Management

Planning and Plan Revision

Algorithmic Skeleton (DCT example):

4. Tool and Data Ecosystem

Tool Taxonomy and APIs

5. Evaluation Protocols and Empirical Findings

Metrics

Benchmarks and Comparative Results

6. Representative Architectures and System Implementations

Dynamic Context Tuning (DCT) (Soni et al., 5 Jun 2025)

Multi-Agent Graph-Structured Systems (Liu et al., 16 May 2025)

RL-Based End-to-End Agents (Ning et al., 26 Sep 2025, Hu et al., 31 Dec 2025)

7. Open Challenges and Research Frontiers

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research