Papers
Topics
Authors
Recent
2000 character limit reached

TripTailor Benchmark for Personalized Travel Planning

Updated 16 December 2025
  • TripTailor Benchmark is a large-scale evaluation framework using 4,000 real itineraries and over 500,000 POIs from China to assess personalized travel planning agents.
  • It employs practical metrics including route efficiency, feasibility, and user personalization, addressing limitations of synthetic or narrow-domain datasets.
  • The framework introduces workflow decomposition and integrates geographic reasoning to simulate human travel planning in realistic scenarios.

TripTailor is a large-scale, real-world benchmark for evaluating personalized travel planning agents, addressing the deficiencies of prior benchmarks that relied on simulated data, narrow geographical coverage, or constraint-only metrics. By providing nearly 4,000 authentic itineraries and over 500,000 real-world points of interest (POIs) across China, TripTailor enables the rigorous, multifaceted assessment of itinerary generation and recommendation models, particularly those powered by LLMs (Shen et al., 2 Aug 2025).

1. Motivations and Benchmark Objectives

Existing travel planning benchmarks such as TravelPlanner (ICML ’24) and ChinaTravel are based either on synthetic itineraries or are limited to a few cities, typically failing to capture the diversity, complexity, and scale of real-world travel scenarios. Previous datasets—usually restricted to around 10 cities with approximately 1,200 POIs each—focus mainly on hard constraint satisfaction (e.g., meeting time budgets, avoiding double-bookings), neglecting dimensions such as plan coherence, route efficiency, and fine-grained personalization.

TripTailor is designed with three primary objectives:

  1. To provide a genuinely large-scale, real-world dataset of POIs and human-written itineraries, capturing both intra- and intercity travel complexity.
  2. To define an evaluation framework that unifies feasibility (hard constraints), rationality (route optimization and temporal plausibility), and personalization (user preference alignment).
  3. To establish a “workflow decomposition” baseline, simulating the multi-step reasoning processes of human travel planners.

This approach directly addresses known shortcomings in both scale and evaluative scope, positioning TripTailor as a benchmark for pushing travel planning agents toward genuinely human-level planning ability.

2. Dataset Construction and Characteristics

TripTailor’s dataset is constructed from a comprehensive selection of sources and includes highly granular POI metadata, human-authored itineraries, and systematically processed user queries.

Data sources and processing:

  • POI metadata (coordinates, ratings, prices, opening hours, recommended durations, and text summaries) is scraped from official travel platforms and Amap, covering 40 top-visited Chinese cities.
  • Real-world itineraries are harvested from popular online travel agencies, restricted to well-documented, self-guided plans with complete schedule, transportation, and accommodation details.
  • Each itinerary is transformed into a first-person user query via LLM rewriting, which systematically hides specific entity names but preserves user preferences, temporal constraints, and budget specifications.
  • Both itinerary and query undergo multi-stage quality control: automated slot validation and manual review to eliminate hallucinations, slot gaps, and obvious scheduling artifacts.

Data scale and composition:

Item Type Training Samples Test Samples Total POIs Notable Statistics
Itineraries 3,145 703 - Each test set: 354 "Easy" (2–3 days), 349 "Hard" (4–7 days)
Attractions - - 5,622 All ≥4A or in verified plans
Restaurants - - 422,120 Local specialty/international
Hotels - - 89,224 Economy–luxury
Flights - - 15,110 Major intercity links
Trains - - 28,832 Major intercity/high-speed

Distribution by itinerary length in test set: 2-day (120), 3-day (234), 4-day (196), 5-day (116), 6-day (29), 7-day (8). Hard queries have high intercity transport and average 5.2 days with 4–6 attractions and 2 meals per day.

3. Benchmark Tasks and Evaluation Methodologies

TripTailor formalizes three primary task types:

  • Itinerary Generation: Given a user query and limited “sandbox” knowledge access (e.g., POI/transport APIs), generate a complete daily plan, specifying attractions, accommodations, transport, and meals.
  • Itinerary Recommendation: Retrieve or rank previously curated itineraries corresponding to new user queries.
  • Comparative Evaluation: Assess whether an LLM-generated itinerary matches or surpasses a human-drafted plan.

Central evaluation metrics include:

  1. Constraint Satisfaction Score:

Scons(I)=1#violated_constraints(I)#total_constraintsS_{cons}(I) = 1 - \frac{\#\,\text{violated\_constraints}(I)}{\#\,\text{total\_constraints}}

Constraints comprise “Within-Sandbox” (no hallucinated POIs), “Complete Information” (no missing meals/transport), deadline adherence, and resource limits.

  1. Itinerary Utility/Quality:

U(I,u)=pIwp(u)qpc(I)U(I,u) = \sum_{p\in I} w_p(u)\,q_p - c(I)

Here, qpq_p is POI quality, wp(u)w_p(u) a user preference weight, c(I)c(I) total cost.

  1. Route Efficiency:

Davg=1ndk=1nd(1Mk1j=1Mk1dj,j+1k)D_{\mathrm{avg}} = \frac{1}{n_d}\sum_{k=1}^{n_d}\left(\frac{1}{M_k-1}\sum_{j=1}^{M_k-1}d^k_{j,j+1}\right)

ndn_d: number of days; MkM_k: POIs/day; dj,j+1kd^k_{j,j+1}: straight-line distance between consecutive POIs. Evaluations use the ratio Davg(LLM)/Davg(real)D_{\mathrm{avg}}(\text{LLM}) / D_{\mathrm{avg}}(\text{real}).

The evaluation protocol employs strong LLM baselines across multiple planning paradigms—direct, chain-of-thought, ReAct, Reflexion, and workflow decomposition—and comparative assessment via both LLM and learned reward models.

4. Experimental Results and Analytical Findings

Benchmark experiments show a substantive performance gap between current LLMs and human planners:

  • Overall Pass and Surpassing Rates: Even with access to full POI lists, GPT-4o achieves a “Final Pass Rate” of only 21.5%, with fewer than 10% of outputs judged to surpass human plans in personalization.
  • Model Performance: Direct-DeepSeek-V3 achieves 14.4% pass/7.8% surpass; o1-mini achieves 18.3%/9.4%.
  • Feasibility: GPT-4o generates within-sandbox plans at 96.6%, achieves 100% “complete info,” but rationality (e.g., correct meal price 44%, visit durations 64%) remains deficient.
  • Personalization: Only 12.9% of GPT-4o’s plans, as judged by LLM, and 22.3% by reward model, are rated as “surpassing” the human plan.
  • Route Efficiency: Real plans have a mean DavgD_{\mathrm{avg}} of 7.3 km/day; GPT-4o’s plans have 17.1 km/day (\approx3.3×\times less efficient).
  • Score Distributions: 80% of real plans score \geq4/5 for personalization; only \approx20% of o1-mini outputs achieve this threshold.

These results underscore the inherent difficulty of end-to-end itinerary generation under realistic, multi-faceted evaluation criteria.

5. Identified Challenges

Analysis of model outputs and submetric failures surfaces several key challenges:

  1. Feasibility Versus Quality: Satisfaction of hard constraints does not imply high-quality or user-satisfying itineraries; spatial coherence, temporal plausibility, and the absence of excessive travel are critical yet underachieved.
  2. Spatial Reasoning Deficits: LLMs struggle to optimize POI sequencing, often leading to unnecessary travel distances and impractical day plans.
  3. Personalization Granularity: Surface-level personalization (e.g., matching cuisine category) is achievable, but deeper alignment with user pace, theme, or latent preferences is rarely attained.
  4. Hallucination and Category Errors: Reasoning-centric models (e.g., o1-mini) may hallucinate destinations or misclassify transportation options, undermining reliability.

6. Methodological Innovations and Future Directions

TripTailor’s framework recommends the integration of specialized geographic reasoning algorithms (such as graph-based TSP solvers) within LLM-driven planning pipelines to address route optimization. Richer user models—potentially using latent variable inference—are necessary to encode and optimize for comfort, adventure, and thematic coherence. Evaluation procedures are likely to expand toward multi-turn dialogue, temporal constraint validation, and global applicability, with explicit suggestions to port TripTailor’s approach to other regions and to incorporate multi-modal information (e.g., images, maps).

Comparison with TripCraft ("TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning" (Chaudhuri et al., 27 Feb 2025)) highlights several benchmarking trends:

  • Shift from semi-synthetic to grounded, real-world data and constraint modeling.
  • Movement beyond binary feasibility toward continuous, explainable evaluation metrics (e.g., for meal timing, attraction duration/type, persona fit).
  • Emergence of “parameter-informed” prompting, using statistical distributions to guide LLM output toward empirical human behaviors.
  • Need for balancing stricter numerical objectives with hard/commonsense feasibility.

A plausible implication is that future benchmarks will increasingly embed spatio-temporal, economic, and persona distributions into both data and evaluation layers, as exemplified by TripCraft’s pipeline, to further expose weaknesses and guide model improvement.

7. Significance and Benchmark Impact

TripTailor establishes a new standard for real-world, large-scale, and personalization-aware evaluation in travel planning. By integrating detailed POI metadata, authentic itineraries, and a systematic, multi-dimensional scoring apparatus, the benchmark reveals critical weaknesses in contemporary LLM agents—particularly in spatial reasoning and deep personalization. The TripTailor dataset, experimental protocol, and baseline results collectively serve as a foundation for the development of next-generation travel planners capable of matching or exceeding human-level itinerary construction, and provide a rigorous testbed for academic and industrial research in sequential decision-making under complex, personalized constraints (Shen et al., 2 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to TripTailor Benchmark.