TravelBench: Benchmark for Travel AI Evaluation

Updated 3 January 2026

TravelBench is a benchmark suite that evaluates travel-domain AI through multi-turn dialog, tool orchestration, and unsolvability detection.
It employs controlled tool environments and rubric-based scoring to ensure deterministic evaluations and reproducible results.
The benchmark design spans multi-turn, single-turn, and unsolvable tasks, advancing real-world itinerary planning and low-resource NLP assessment.

TravelBench denotes multiple distinct, high-impact benchmarks targeting various aspects of travel-domain artificial intelligence—including multi-turn tool-augmented itinerary planning with LLMs, low-resource travel-domain NLP, and GPS trajectory mode detection—with each “TravelBench” instantiation representing contributions to robust, reproducible evaluation in real-world travel scenarios. Most notably, “TravelBench” refers to the benchmark introduced by Cheng et al. (2025) for spatio-temporal tool-augmented travel planning with LLM agents, which has become the reference testbed for this capability class. Other uses include the TravelBench GPS trajectory benchmark for travel-mode detection (Chen et al., 2021), and domain-specific low-resource NLP evaluation (&&&1&&&). The following overview focuses on the tool-augmented LLM agent planning and evaluation paradigm, synthesizing benchmark structure, evaluation methodology, and domain significance.

1. Motivation and Evaluation Scope

TravelBench was developed to address the limitations of early LLM-centric travel planning benchmarks, which suffered from static (single-turn) scenarios, lack of multi-turn dialog, restricted domain breadth, and absence of deterministic tool environments. Real-world travel planning constitutes a challenging evaluation domain due to its requirement for multi-step reasoning, dynamic and incomplete user preferences, diverse external constraints (budgets, schedules, weather), and the orchestration of external tool APIs. This makes it especially suitable for stress-testing agentic model capabilities in planning, tool use, interactive clarification, and infeasibility detection (Cheng et al., 27 Dec 2025).

Compared to precursors such as TravelPlanner, TripScore, ChinaTravel, and Flex-TravelPlanner, TravelBench is explicitly designed to (1) support multi-turn dialog planning; (2) assess interactive and one-shot planning; (3) test unsolvability detection; (4) control tool outputs via a sandbox/caching framework; and (5) evaluate both open- and closed-source models in a stable, reproducible setting (Cheng et al., 27 Dec 2025, Hu et al., 31 Dec 2025).

2. Benchmark Task Design and Tool Environment

TravelBench comprises three distinct evaluation subsets:

Multi-turn planning: The agent must conduct a dialog to elicit overlooked constraints, issue a sequence of tool calls (maps, weather, transport, web search), and incrementally revise the plan.
Single-turn planning: The agent responds to a compound request in a single exchange, decomposing the problem and orchestrating tool calls to produce the itinerary.
Unsolvable requests: The benchmark includes queries that cannot be satisfied by the tool suite or dataset (e.g., physically impossible, missing context, unimplemented tool functionality). Correct refusal or clarification is the only valid response.

The environment exposes a suite of 10 canonical travel-domain tools, including POI search (map_search_places), route computation (map_compute_routes), POI ranking, flight/train queries, weather APIs, and web search. All tool signatures are formally specified; inputs and outputs are controlled for determinism using a tool-trace cache with ICL-based simulation for cache misses (Cheng et al., 27 Dec 2025, Hu et al., 31 Dec 2025).

Experimental reproducibility is enforced by isolating agent runs from external API randomness, with deterministic temperature settings for both agent and user-simulator. During evaluation, tool misuse and argument errors (missing fields, type mismatches) are penalized via rubric-based error rates.

3. Evaluation Metrics, Formalism, and Score Aggregation

TravelBench employs a rubric-based, multi-tiered evaluation protocol. For each subtask:

Unsolvable accuracy

For each instance $j$ : $y_j = \begin{cases} 1, & \text{if agent's first response contains }[\text{Unsolved}] \ 0, & \text{otherwise} \end{cases}$ Unsolvable accuracy: $S_{\mathrm{unsolved}} = \frac{1}{N_{\mathrm{unsolved}}} \sum_{j=1}^{N_{\mathrm{unsolved}}} y_j \times 100$

Single-turn and Multi-turn rubric scores

Rubric dimensions (reasoning, summarization, presentation; plus user_interaction for multi-turn) are rated $r_i \in \{1,\dots,5\}$ : $\bar r = \begin{cases} \frac{1}{3}\sum_{i=1}^3 r_i, & \text{single-turn} \ \frac{1}{4}\sum_{i=1}^4 r_i, & \text{multi-turn} \end{cases}$ Normalized to [0,100]: $S_{\mathrm{single}} = \frac{\bar r - 1}{4}\times 100\,,\quad S_{\mathrm{multi}} = \frac{\bar r - 1}{4}\times 100$

Penalty terms

Tool-call error penalty:

$w_1 = 1 - \frac{N_{\mathrm{err}}}{N_{\mathrm{all}}}, \quad 0 \leq w_1 \leq 1$

Meta-judge calibration: $w_2 = \frac{s}{5}$ Final penalized score: $S_t^{\mathrm{pen}} = S_t\,w_1\,w_2$ Aggregate score: $S_{\mathrm{avg}} = \frac{ S_{\mathrm{single}}^{\mathrm{pen}} + S_{\mathrm{multi}}^{\mathrm{pen}} + S_{\mathrm{unsolved}} }{3 }$

This multi-layer rubric translates to end-to-end quantitative evaluation of model reasoning, tool reliability, user interaction, and infeasibility detection (Cheng et al., 27 Dec 2025, Hu et al., 31 Dec 2025).

4. Dataset Construction and Task Taxonomy

The core TravelBench dataset comprises ≈4,000 real user queries sourced from Alibaba’s travel platform and filtered by dual-LM solvability annotation (GPT-5.1, Qwen3-235B). The taxonomy spans:

POI discovery (regional, thematic search)
Dynamic information (closing times, current conditions)
Rules and policies (e.g., extra-baggage, visa)
Iterative itinerary planning
Application/task-level interactions

Queries are divided into 500-instance single-turn and multi-turn subsets, and 103 unsolvable instances, reflecting realistic distributions of travel complexity. Multi-turn dialogues require simulated/real user feedback for constraint elicitation.

5. Results, Baselines, and Failure Analysis

Benchmarking on TravelBench reveals a persistent gap between state-of-the-art LLMs and required planning robustness:

Closed-source LLMs (GPT-5.1) achieve overall rubric-penalized scores of 68.89, while the best open-source models (Qwen3-30B-A3B-Thinking) average 51.39–61.80 (Cheng et al., 27 Dec 2025, Hu et al., 31 Dec 2025).
Multi-turn planning consistently lags single-turn due to context drift, constraint forgetfulness, and tool-error propagation.
Tool-call penalties are substantial (10–20% of all tasks), with frequent failure cases tied to argument schema mismatches and long planning horizons.
Unsovlable accuracy is highest for closed-source LLMs, indicating better infeasibility recognition (e.g., Qwen-Plus at 85.11%).

Qualitative analysis identifies iterative itinerary refinement (using weather forecasts, route validation), weather-aware POI routing, and proactive clarifications as indicative of higher-performing agentic behavior. Over/under-questioning, plan drift, and incorrect refusals represent salient failure modes.

6. Significance, Extensions, and Integration with Other Efforts

TravelBench is recognized as a canonical evaluation for LLM agentic planning, underpinning technical reports such as the AMAP Agentic Planning report (STAgent), which uses TravelBench as a core validation bed for spatio-temporal planning agents. STAgent, trained on hierarchical curation and SFT-guided RL, achieves state-of-the-art performance, outperforming much larger base models via aggressive data filtering and cascaded RL (Hu et al., 31 Dec 2025).

TravelBench’s formal evaluation and controlled tool environment have informed subsequent benchmarks and evaluation pipelines:

TripScore: TravelBench can be extended via TripScore’s unified scalar reward (aggregating hard constraints, commonsense, soft itinerary quality, and personal preference satisfaction) for direct RL-based agent optimization (Qu et al., 10 Oct 2025).
Flex-TravelPlanner: Introduces dynamic multi-turn constraint adaptation and explicit constraint hierarchy evaluation, revealing that single-turn success does not predict multi-turn robustness (Oh et al., 5 Jun 2025).
TripCraft and Travel-Sim: Provide compositional, continuous, and agent-based simulation metrics for itinerary realism and coherence, suggesting that TravelBench could integrate simulation-based metrics for consistency over long-horizon plans (Chaudhuri et al., 27 Feb 2025, Yang et al., 14 Jun 2025).
SynthTRIPs: Synthetic query pipelines can supplement TravelBench with persona- and sustainability-aware queries for evaluating conversational recommenders (Banerjee et al., 12 Apr 2025).
Low-Resource NLP: TravelBench is also used for benchmarking LLMs on low-resource travel-domain NLP, revealing that domain adaptation is essential for robust application (Billa et al., 3 Oct 2025).

A plausible implication is that future versions of TravelBench may unify symbolic constraint languages (cf. ChinaTravel’s DSL), multi-modal context, and agent-based simulation to extend the assessment of LLM planners beyond tool orchestration toward real-world and user-centric robustness (Shao et al., 2024).

7. Public Resources, Licensing, and Reproducibility

TravelBench is distributed as a controlled static environment, with publicly available code and benchmark data (where permitted by corporate data sources). The tool suite and scripting interface specification are available, enabling third-party model evaluation and RL training pipelines. Data and code are open-sourced under CC BY 4.0 or similar licenses where possible; benchmarking scripts, tool traces, and evaluation pipelines can be accessed as described in the relevant papers (Cheng et al., 27 Dec 2025, Hu et al., 31 Dec 2025).

Extending TravelBench is explicitly encouraged: recommended directions include augmenting tool sets (hotel and ride-hailing APIs), constructing more sophisticated user simulators, experimenting with reinforcement-learning reward models, and integrating internationalization and multi-modal capabilities.

References: (Cheng et al., 27 Dec 2025, Hu et al., 31 Dec 2025, Qu et al., 10 Oct 2025, Oh et al., 5 Jun 2025, Chaudhuri et al., 27 Feb 2025, Banerjee et al., 12 Apr 2025, Yang et al., 14 Jun 2025, Chen et al., 2021, Billa et al., 3 Oct 2025, Shao et al., 2024)