Papers
Topics
Authors
Recent
Search
2000 character limit reached

TourPlanner: Multi-Day Itinerary Generation

Updated 4 July 2026
  • TourPlanner is a travel-planning framework that produces complete multi-day itineraries from a single natural language query.
  • It integrates personalized recall (PReSO), multi-agent consensus reasoning (CCoT), and an RL-based refinement stage to manage hard and soft constraints.
  • Evaluations show enhanced feasibility and rationality, demonstrating the benefits of structured search-space construction and staged optimization in itinerary planning.

TourPlanner is a travel-planning framework for generating a complete multi-day itinerary from a single natural-language user query. It is formulated as single-turn itinerary generation: given a query QQ containing explicit requirements such as origin, destination, dates, duration, budget, and interest hints, the system produces a full itinerary II including transportation, accommodation, and day-by-day activities. The framework combines three stages—Personalized Recall and Spatial Optimization (PReSO), Competitive consensus Chain-of-Thought (CCoT), and a reinforcement-learning refinement stage with a sigmoid gate—so that candidate retrieval, plan-space exploration, and constraint-sensitive refinement are handled separately but end to end (Wang et al., 8 Jan 2026).

1. Task definition and problem setting

TourPlanner addresses realistic travel planning over a large grounded action space rather than free-form destination recommendation. In the reported benchmark setting, the planner operates in the TripTailor sandbox, which contains 40 major Chinese cities and inventories of over 28,000 train schedules, 15,000 flight routes, 5,622 attractions, 89,000 hotels, and 422,000 restaurants (Wang et al., 8 Jan 2026). The input is a single user query QQ; the output is a structured itinerary II covering accommodations, transportation, restaurants, and attractions across multiple days.

The paper identifies three bottlenecks. First, candidate point-of-interest pruning must preserve a high recall rate. Second, a single reasoning path limits exploration in a combinatorial solution space. Third, simultaneously optimizing hard constraints and soft constraints is difficult because dense soft signals can dominate sparse but essential feasibility conditions (Wang et al., 8 Jan 2026). In this formulation, hard constraints include using only valid database entities, respecting opening hours, and avoiding duplicate attractions, while soft constraints include budget reasonableness, route efficiency, and preference alignment.

This problem setting sits between earlier tourist agenda optimization and recent LLM-driven itinerary generation. Earlier planning-and-scheduling work modeled tourist routes as feasible plans over visits, durations, opening hours, and travel times, with soft penalties over POI utility, travel burden, visit count, and temporal occupation (Ibáñez-Ruiz et al., 2017). More recent hybrid systems have treated travel planning as oversubscription planning, using LLMs for extraction and automated planners for validity and optimization guarantees (Rosa et al., 2024). TourPlanner instead remains LLM-centric throughout the pipeline, but grounds the process in a structured sandbox and separates retrieval, multi-path reasoning, and RL-based refinement (Wang et al., 8 Jan 2026).

2. System architecture

The framework is organized as a staged pipeline. A user query first enters PReSO, which constructs a spatially aware candidate set of attractions and nearby hotels and restaurants. That output becomes the context for CCoT, which instantiates 4–6 specialized planning agents, generates parallel daily proposals, scores them through a diversity-weighted consensus rule, and fuses the top-kk proposals into a day-level consensus plan. The full itinerary is then passed to a reinforcement-learning refiner that edits the plan under a hard-constraint-first reward design (Wang et al., 8 Jan 2026).

Component Role Key mechanisms
PReSO Candidate construction explicit and implicit preference extraction, three-branch POI recall, DBSCAN clustering
CCoT Multi-path reasoning 4–6 specialized agents, peer review, diversity-weighted consensus, top-kk fusion
RL refinement Post-hoc repair and improvement hard/soft reward decomposition, sigmoid gate, GSPO optimization

PReSO produces a compact “given information” package enriched with cluster labels. CCoT operates day by day, updating prior commitments so previously selected attractions are not reused. The RL stage then acts as a validator-fixer: it minimally edits the itinerary, replacing noncompliant items, preferring same-cluster substitutions, and tightening temporal structure while preserving diversity and source integrity (Wang et al., 8 Jan 2026).

A common reduction of the framework to “prompt engineering” is therefore inaccurate. The paper presents TourPlanner as a retrieval-and-pruning layer, a structured multi-agent search layer, and a reward-driven repair layer, each addressing a different failure mode in itinerary generation (Wang et al., 8 Jan 2026).

3. PReSO: personalized recall and spatial optimization

PReSO begins with user profile construction. It extracts explicit fields such as departure city, destination city, departure and return day or time, duration, budget, other requirements, and restaurant type. It then performs LLM-based demand inference for two latent budget-sensitive preferences: hotel cost class [Luxury,Upscale,Midscale,Economy][\text{Luxury}, \text{Upscale}, \text{Midscale}, \text{Economy}] and meal cost range [min cost,max cost][\text{min cost}, \text{max cost}] (Wang et al., 8 Jan 2026). The inferred hotel budget per night is computed as

Budget×0.55N,\frac{\text{Budget} \times 0.55}{N},

with N=travel days1N = \text{travel days} - 1, and meal budget per day as

II0

The method then selects the highest hotel class whose minimum price fits the inferred per-night budget (Wang et al., 8 Jan 2026).

Candidate attraction recall uses three sources. The first is semantic similarity recall, where an embedding model retrieves attractions relevant to the query and expands keywords with synonyms. The second is canonical landmark recall, where attractions rated 4A or above are ranked by popularity and user ratings. The third is LLM-supplemented recall, where an LLM proposes additional attractions aligned with user preferences that may not surface through simple semantic or landmark retrieval (Wang et al., 8 Jan 2026). The appendix reports a semantic similarity recall number of II1 and a total POI recall budget of II2.

Because these recalled attractions may be geographically scattered, PReSO applies DBSCAN to attraction coordinates, using adaptive II3-neighborhood adjustment, then uses cluster centroids as anchors for hotel and restaurant retrieval. The reported DBSCAN hyperparameters are minimum samples II4, epsilon II5, and minimum cluster number II6 duration (Wang et al., 8 Jan 2026). Cluster labels are attached to attractions, restaurants, and hotels and passed downstream, turning a flat city-wide candidate pool into neighborhood-structured planning context.

The empirical effect is measured as candidate recall of ground-truth itinerary elements. With GPT-4o, PReSO achieves 42.26 recall versus 27.83 for the TripTailor workflow, a gain of 14.43 points, and the figure reports consistent gains across all tested backbones (Wang et al., 8 Jan 2026). This suggests that, in this framework, retrieval quality is not merely a front-end convenience but a direct determinant of downstream plan quality.

4. CCoT: competitive consensus over multiple reasoning paths

CCoT is TourPlanner’s mechanism for feasible-space exploration. For a query II7, the system initializes a static set of II8 specialized agents,

II9

where each agent QQ0 has an identity QQ1, an objective QQ2, and ranked priorities QQ3 (Wang et al., 8 Jan 2026). The prompt design requires measurable objectives such as minimizing average leg distance or keeping total cost within budget. The default configuration uses 4–6 agents; fewer agents reduce coverage of competing objectives, while more agents show diminishing returns (Wang et al., 8 Jan 2026).

Planning is iterative across days. For day QQ4, a general expert agent creates a base route skeleton QQ5. Each specialized agent then refines that skeleton into its own daily proposal: QQ6 Here QQ7 is the current given information and QQ8 is the consensus plan for prior days (Wang et al., 8 Jan 2026). Attractions, restaurants, and accommodations already used in earlier consensus days are removed from the context before planning day QQ9, which enforces cross-day diversity.

Proposal arbitration proceeds in three phases. First, proposal diversity weighting uses embedded proposal vectors II0 and a cosine-similarity matrix

II1

The average similarity of proposal II2 to the others is

II3

the raw diversity weight is

II4

and the normalized weight is

II5

with II6 (Wang et al., 8 Jan 2026). Proposals that are more distinct receive larger weight.

Second, every agent reviews every proposal, assigning a score II7 and a natural-language critique II8 according to its own objective, ranked priorities, and feasibility judgment (Wang et al., 8 Jan 2026). Third, proposal II9 receives an aggregated consensus score: kk0 The top-kk1 proposals are selected, with default kk2, and an LLM fuses them into the day’s consensus plan kk3 while preserving geographic continuity and respecting explicit hard constraints (Wang et al., 8 Jan 2026).

The ablation study supports the value of this mechanism. Using 4–6 agents yields the best balance; using 3 agents hurts macro rationality and final pass rate, using 10 agents shows diminishing returns, and directly combining proposals without CCoT reduces macro rationality to 84.9 and final pass rate to 47.8 (Wang et al., 8 Jan 2026). In this sense, CCoT is not only an ensemble but an explicit arbitration protocol over specialized itinerary hypotheses.

5. Constraint-gated reinforcement learning

The final refinement stage is framed as sequence-level RL over itinerary edits. The policy receives the query and the consensus itinerary and produces a refined itinerary. The paper does not define a symbolic environment transition system over itinerary states; operationally, the policy acts through autoregressive sequence generation and is trained with Group Sequence Policy Optimization (GSPO) (Wang et al., 8 Jan 2026).

The reward is decomposed into hard and soft parts: kk4 where

kk5

The reported hyperparameters are kk6 and kk7 (Wang et al., 8 Jan 2026). The role of the gate is to suppress soft-objective optimization until hard-constraint satisfaction is sufficiently high.

The hard reward is

kk8

Feasibility is

kk9

where kk0 checks that all entities exist in the sandbox and kk1 checks that required details are complete. Rationality is

kk2

where the indicators test restaurant uniqueness, attraction uniqueness, valid attraction duration, and validity of visit times against opening hours (Wang et al., 8 Jan 2026).

The soft reward is

kk3

The budget score is

kk4

where kk5 is generated plan cost and kk6 is budget. The route score is

kk7

and the preference-alignment score is

kk8

The average route distance of a plan is defined as

kk9

All of these are sequence-level itinerary scores rather than per-step rewards (Wang et al., 8 Jan 2026).

GSPO is optimized with

[Luxury,Upscale,Midscale,Economy][\text{Luxury}, \text{Upscale}, \text{Midscale}, \text{Economy}]0

where the group-based advantage is

[Luxury,Upscale,Midscale,Economy][\text{Luxury}, \text{Upscale}, \text{Midscale}, \text{Economy}]1

The appendix further gives the sequence-level importance ratio

[Luxury,Upscale,Midscale,Economy][\text{Luxury}, \text{Upscale}, \text{Midscale}, \text{Economy}]2

The training setup uses Qwen2.5-3B-Instruct for reward-model fine-tuning and 32 NVIDIA H800 GPUs for RL fine-tuning, with learning rate [Luxury,Upscale,Midscale,Economy][\text{Luxury}, \text{Upscale}, \text{Midscale}, \text{Economy}]3, global batch size 32, mini-batch size 4, 8 responses per prompt, and 3 epochs (Wang et al., 8 Jan 2026).

The ablation shows why the gate matters. Constraint-Gated RL achieves final pass 56.1 and final surpassing 30.2, while Vanilla RL drops final pass to 47.5 despite improving route ratio to 1.91 and reduces rationality macro to 67.9 (Wang et al., 8 Jan 2026). The intended interpretation is explicit in the paper: without gating, dense soft signals are easier to optimize than sparse hard constraints.

6. Evaluation, significance, and relation to adjacent systems

TourPlanner’s main evaluation metrics are Feasibility Pass Rate, Rationality Pass Rate, Average Route Distance Ratio, Final Pass Rate, and Final Surpassing Rate (Wang et al., 8 Jan 2026). The final system reports feasibility 100.0 micro / 100.0 macro, rationality 97.1 micro / 88.7 macro, average route distance ratio 2.15, final pass rate 56.1, and final surpassing rate 30.2 (Wang et al., 8 Jan 2026). TourPlanner without RL already achieves near-saturated feasibility and strong rationality, but the RL stage further improves final pass and surpassing rate.

These results are reported against Direct Planning, ReAct Planning, and the TripTailor workflow, and the gains are described as backbone-agnostic across GPT-4o, Qwen3-235B-A22B-Instruct, Qwen3-30B-A3B-Thinking, and DeepSeek-R1 (Wang et al., 8 Jan 2026). This suggests that the contribution is primarily architectural: retrieval quality, diversified reasoning, and staged optimization matter independently of the underlying frontier model.

Within the broader literature, TourPlanner occupies a distinct position. It differs from tourist-agenda planners that explicitly optimize soft penalties over utility, travel, visit count, and occupation using PDDL or CSP solvers (Ibáñez-Ruiz et al., 2017). It also differs from hybrid planner architectures such as TRIP-PAL, which use LLMs for travel information extraction but rely on automated planners for validity and optimality in single-day oversubscription planning (Rosa et al., 2024). Relative to multi-agent LLM systems such as Vaiage, which use a graph-structured multi-agent framework with tool grounding and report an average score of 8.5 out of 10, TourPlanner’s distinctive contribution is the diversity-weighted competitive consensus mechanism plus the sigmoid-gated RL stage (Liu et al., 16 May 2025). Relative to itinerary-editing work such as iTIMO, which formalizes REPLACE, ADD, and DELETE operations for modification rather than generation, TourPlanner remains a single-turn itinerary generator rather than an incremental editor (Huang et al., 15 Jan 2026).

The limitations stated in the paper are correspondingly specific. End-to-end RL over the whole CCoT process is described as difficult because the reasoning process is complex and iterative across days. Reward modeling is also described as limited; better user-aligned reward models are identified as a route to improving the surpassing rate (Wang et al., 8 Jan 2026). A plausible implication is that TourPlanner’s current gains come more from search-space construction and staged control than from a fully resolved model of user utility.

In that sense, TourPlanner can be understood as an overview of three ideas that have often been separated in the travel-planning literature: high-recall grounded retrieval, explicit multi-path reasoning, and hard-constraint-first optimization. Its significance lies less in any one module than in the fact that these modules are coupled tightly enough to raise both feasibility and user-preference alignment in a large, structured itinerary benchmark (Wang et al., 8 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TourPlanner.