TourPlanner: Multi-Day Itinerary Generation

Updated 4 July 2026

TourPlanner is a travel-planning framework that produces complete multi-day itineraries from a single natural language query.
It integrates personalized recall (PReSO), multi-agent consensus reasoning (CCoT), and an RL-based refinement stage to manage hard and soft constraints.
Evaluations show enhanced feasibility and rationality, demonstrating the benefits of structured search-space construction and staged optimization in itinerary planning.

TourPlanner is a travel-planning framework for generating a complete multi-day itinerary from a single natural-language user query. It is formulated as single-turn itinerary generation: given a query $Q$ containing explicit requirements such as origin, destination, dates, duration, budget, and interest hints, the system produces a full itinerary $I$ including transportation, accommodation, and day-by-day activities. The framework combines three stages—Personalized Recall and Spatial Optimization (PReSO), Competitive consensus Chain-of-Thought (CCoT), and a reinforcement-learning refinement stage with a sigmoid gate—so that candidate retrieval, plan-space exploration, and constraint-sensitive refinement are handled separately but end to end (Wang et al., 8 Jan 2026).

1. Task definition and problem setting

TourPlanner addresses realistic travel planning over a large grounded action space rather than free-form destination recommendation. In the reported benchmark setting, the planner operates in the TripTailor sandbox, which contains 40 major Chinese cities and inventories of over 28,000 train schedules, 15,000 flight routes, 5,622 attractions, 89,000 hotels, and 422,000 restaurants (Wang et al., 8 Jan 2026). The input is a single user query $Q$ ; the output is a structured itinerary $I$ covering accommodations, transportation, restaurants, and attractions across multiple days.

The paper identifies three bottlenecks. First, candidate point-of-interest pruning must preserve a high recall rate. Second, a single reasoning path limits exploration in a combinatorial solution space. Third, simultaneously optimizing hard constraints and soft constraints is difficult because dense soft signals can dominate sparse but essential feasibility conditions (Wang et al., 8 Jan 2026). In this formulation, hard constraints include using only valid database entities, respecting opening hours, and avoiding duplicate attractions, while soft constraints include budget reasonableness, route efficiency, and preference alignment.

This problem setting sits between earlier tourist agenda optimization and recent LLM-driven itinerary generation. Earlier planning-and-scheduling work modeled tourist routes as feasible plans over visits, durations, opening hours, and travel times, with soft penalties over POI utility, travel burden, visit count, and temporal occupation (Ibáñez-Ruiz et al., 2017). More recent hybrid systems have treated travel planning as oversubscription planning, using LLMs for extraction and automated planners for validity and optimization guarantees (Rosa et al., 2024). TourPlanner instead remains LLM-centric throughout the pipeline, but grounds the process in a structured sandbox and separates retrieval, multi-path reasoning, and RL-based refinement (Wang et al., 8 Jan 2026).

2. System architecture

The framework is organized as a staged pipeline. A user query first enters PReSO, which constructs a spatially aware candidate set of attractions and nearby hotels and restaurants. That output becomes the context for CCoT, which instantiates 4–6 specialized planning agents, generates parallel daily proposals, scores them through a diversity-weighted consensus rule, and fuses the top- $k$ proposals into a day-level consensus plan. The full itinerary is then passed to a reinforcement-learning refiner that edits the plan under a hard-constraint-first reward design (Wang et al., 8 Jan 2026).

Component	Role	Key mechanisms
PReSO	Candidate construction	explicit and implicit preference extraction, three-branch POI recall, DBSCAN clustering
CCoT	Multi-path reasoning	4–6 specialized agents, peer review, diversity-weighted consensus, top- $k$ fusion
RL refinement	Post-hoc repair and improvement	hard/soft reward decomposition, sigmoid gate, GSPO optimization

PReSO produces a compact “given information” package enriched with cluster labels. CCoT operates day by day, updating prior commitments so previously selected attractions are not reused. The RL stage then acts as a validator-fixer: it minimally edits the itinerary, replacing noncompliant items, preferring same-cluster substitutions, and tightening temporal structure while preserving diversity and source integrity (Wang et al., 8 Jan 2026).

A common reduction of the framework to “prompt engineering” is therefore inaccurate. The paper presents TourPlanner as a retrieval-and-pruning layer, a structured multi-agent search layer, and a reward-driven repair layer, each addressing a different failure mode in itinerary generation (Wang et al., 8 Jan 2026).

3. PReSO: personalized recall and spatial optimization

PReSO begins with user profile construction. It extracts explicit fields such as departure city, destination city, departure and return day or time, duration, budget, other requirements, and restaurant type. It then performs LLM-based demand inference for two latent budget-sensitive preferences: hotel cost class $[\text{Luxury}, \text{Upscale}, \text{Midscale}, \text{Economy}]$ and meal cost range $[\text{min cost}, \text{max cost}]$ (Wang et al., 8 Jan 2026). The inferred hotel budget per night is computed as

$\frac{\text{Budget} \times 0.55}{N},$

with $N = \text{travel days} - 1$ , and meal budget per day as

$I$ 0

The method then selects the highest hotel class whose minimum price fits the inferred per-night budget (Wang et al., 8 Jan 2026).

Candidate attraction recall uses three sources. The first is semantic similarity recall, where an embedding model retrieves attractions relevant to the query and expands keywords with synonyms. The second is canonical landmark recall, where attractions rated 4A or above are ranked by popularity and user ratings. The third is LLM-supplemented recall, where an LLM proposes additional attractions aligned with user preferences that may not surface through simple semantic or landmark retrieval (Wang et al., 8 Jan 2026). The appendix reports a semantic similarity recall number of $I$ 1 and a total POI recall budget of $I$ 2.

Because these recalled attractions may be geographically scattered, PReSO applies DBSCAN to attraction coordinates, using adaptive $I$ 3-neighborhood adjustment, then uses cluster centroids as anchors for hotel and restaurant retrieval. The reported DBSCAN hyperparameters are minimum samples $I$ 4, epsilon $I$ 5, and minimum cluster number $I$ 6 duration (Wang et al., 8 Jan 2026). Cluster labels are attached to attractions, restaurants, and hotels and passed downstream, turning a flat city-wide candidate pool into neighborhood-structured planning context.

The empirical effect is measured as candidate recall of ground-truth itinerary elements. With GPT-4o, PReSO achieves 42.26 recall versus 27.83 for the TripTailor workflow, a gain of 14.43 points, and the figure reports consistent gains across all tested backbones (Wang et al., 8 Jan 2026). This suggests that, in this framework, retrieval quality is not merely a front-end convenience but a direct determinant of downstream plan quality.

4. CCoT: competitive consensus over multiple reasoning paths

CCoT is TourPlanner’s mechanism for feasible-space exploration. For a query $I$ 7, the system initializes a static set of $I$ 8 specialized agents,

$I$ 9

where each agent $Q$ 0 has an identity $Q$ 1, an objective $Q$ 2, and ranked priorities $Q$ 3 (Wang et al., 8 Jan 2026). The prompt design requires measurable objectives such as minimizing average leg distance or keeping total cost within budget. The default configuration uses 4–6 agents; fewer agents reduce coverage of competing objectives, while more agents show diminishing returns (Wang et al., 8 Jan 2026).

Planning is iterative across days. For day $Q$ 4, a general expert agent creates a base route skeleton $Q$ 5. Each specialized agent then refines that skeleton into its own daily proposal: $Q$ 6 Here $Q$ 7 is the current given information and $Q$ 8 is the consensus plan for prior days (Wang et al., 8 Jan 2026). Attractions, restaurants, and accommodations already used in earlier consensus days are removed from the context before planning day $Q$ 9, which enforces cross-day diversity.

Proposal arbitration proceeds in three phases. First, proposal diversity weighting uses embedded proposal vectors $I$ 0 and a cosine-similarity matrix

$I$ 1

The average similarity of proposal $I$ 2 to the others is

$I$ 3

the raw diversity weight is

$I$ 4

and the normalized weight is

$I$ 5

with $I$ 6 (Wang et al., 8 Jan 2026). Proposals that are more distinct receive larger weight.

Second, every agent reviews every proposal, assigning a score $I$ 7 and a natural-language critique $I$ 8 according to its own objective, ranked priorities, and feasibility judgment (Wang et al., 8 Jan 2026). Third, proposal $I$ 9 receives an aggregated consensus score: $k$ 0 The top- $k$ 1 proposals are selected, with default $k$ 2, and an LLM fuses them into the day’s consensus plan $k$ 3 while preserving geographic continuity and respecting explicit hard constraints (Wang et al., 8 Jan 2026).

The ablation study supports the value of this mechanism. Using 4–6 agents yields the best balance; using 3 agents hurts macro rationality and final pass rate, using 10 agents shows diminishing returns, and directly combining proposals without CCoT reduces macro rationality to 84.9 and final pass rate to 47.8 (Wang et al., 8 Jan 2026). In this sense, CCoT is not only an ensemble but an explicit arbitration protocol over specialized itinerary hypotheses.

5. Constraint-gated reinforcement learning

The final refinement stage is framed as sequence-level RL over itinerary edits. The policy receives the query and the consensus itinerary and produces a refined itinerary. The paper does not define a symbolic environment transition system over itinerary states; operationally, the policy acts through autoregressive sequence generation and is trained with Group Sequence Policy Optimization (GSPO) (Wang et al., 8 Jan 2026).

The reward is decomposed into hard and soft parts: $k$ 4 where

$k$ 5

The reported hyperparameters are $k$ 6 and $k$ 7 (Wang et al., 8 Jan 2026). The role of the gate is to suppress soft-objective optimization until hard-constraint satisfaction is sufficiently high.

The hard reward is

$k$ 8

Feasibility is

$k$ 9

where $k$ 0 checks that all entities exist in the sandbox and $k$ 1 checks that required details are complete. Rationality is

$k$ 2

where the indicators test restaurant uniqueness, attraction uniqueness, valid attraction duration, and validity of visit times against opening hours (Wang et al., 8 Jan 2026).

The soft reward is

$k$ 3

The budget score is

$k$ 4

where $k$ 5 is generated plan cost and $k$ 6 is budget. The route score is

$k$ 7

and the preference-alignment score is

$k$ 8

The average route distance of a plan is defined as

$k$ 9

All of these are sequence-level itinerary scores rather than per-step rewards (Wang et al., 8 Jan 2026).

GSPO is optimized with

$[\text{Luxury}, \text{Upscale}, \text{Midscale}, \text{Economy}]$ 0

where the group-based advantage is

$[\text{Luxury}, \text{Upscale}, \text{Midscale}, \text{Economy}]$ 1

The appendix further gives the sequence-level importance ratio

$[\text{Luxury}, \text{Upscale}, \text{Midscale}, \text{Economy}]$ 2

The training setup uses Qwen2.5-3B-Instruct for reward-model fine-tuning and 32 NVIDIA H800 GPUs for RL fine-tuning, with learning rate $[\text{Luxury}, \text{Upscale}, \text{Midscale}, \text{Economy}]$ 3, global batch size 32, mini-batch size 4, 8 responses per prompt, and 3 epochs (Wang et al., 8 Jan 2026).

The ablation shows why the gate matters. Constraint-Gated RL achieves final pass 56.1 and final surpassing 30.2, while Vanilla RL drops final pass to 47.5 despite improving route ratio to 1.91 and reduces rationality macro to 67.9 (Wang et al., 8 Jan 2026). The intended interpretation is explicit in the paper: without gating, dense soft signals are easier to optimize than sparse hard constraints.

6. Evaluation, significance, and relation to adjacent systems

TourPlanner’s main evaluation metrics are Feasibility Pass Rate, Rationality Pass Rate, Average Route Distance Ratio, Final Pass Rate, and Final Surpassing Rate (Wang et al., 8 Jan 2026). The final system reports feasibility 100.0 micro / 100.0 macro, rationality 97.1 micro / 88.7 macro, average route distance ratio 2.15, final pass rate 56.1, and final surpassing rate 30.2 (Wang et al., 8 Jan 2026). TourPlanner without RL already achieves near-saturated feasibility and strong rationality, but the RL stage further improves final pass and surpassing rate.

These results are reported against Direct Planning, ReAct Planning, and the TripTailor workflow, and the gains are described as backbone-agnostic across GPT-4o, Qwen3-235B-A22B-Instruct, Qwen3-30B-A3B-Thinking, and DeepSeek-R1 (Wang et al., 8 Jan 2026). This suggests that the contribution is primarily architectural: retrieval quality, diversified reasoning, and staged optimization matter independently of the underlying frontier model.

Within the broader literature, TourPlanner occupies a distinct position. It differs from tourist-agenda planners that explicitly optimize soft penalties over utility, travel, visit count, and occupation using PDDL or CSP solvers (Ibáñez-Ruiz et al., 2017). It also differs from hybrid planner architectures such as TRIP-PAL, which use LLMs for travel information extraction but rely on automated planners for validity and optimality in single-day oversubscription planning (Rosa et al., 2024). Relative to multi-agent LLM systems such as Vaiage, which use a graph-structured multi-agent framework with tool grounding and report an average score of 8.5 out of 10, TourPlanner’s distinctive contribution is the diversity-weighted competitive consensus mechanism plus the sigmoid-gated RL stage (Liu et al., 16 May 2025). Relative to itinerary-editing work such as iTIMO, which formalizes REPLACE, ADD, and DELETE operations for modification rather than generation, TourPlanner remains a single-turn itinerary generator rather than an incremental editor (Huang et al., 15 Jan 2026).

The limitations stated in the paper are correspondingly specific. End-to-end RL over the whole CCoT process is described as difficult because the reasoning process is complex and iterative across days. Reward modeling is also described as limited; better user-aligned reward models are identified as a route to improving the surpassing rate (Wang et al., 8 Jan 2026). A plausible implication is that TourPlanner’s current gains come more from search-space construction and staged control than from a fully resolved model of user utility.

In that sense, TourPlanner can be understood as an overview of three ideas that have often been separated in the travel-planning literature: high-recall grounded retrieval, explicit multi-path reasoning, and hard-constraint-first optimization. Its significance lies less in any one module than in the fact that these modules are coupled tightly enough to raise both feasibility and user-preference alignment in a large, structured itinerary benchmark (Wang et al., 8 Jan 2026).