Papers
Topics
Authors
Recent
Search
2000 character limit reached

TRIP-Bench: Long-Horizon Planning Benchmark

Updated 9 February 2026
  • TRIP-Bench is a long-horizon benchmark framework designed to evaluate interactive agents in complex, multi-city, multi-day travel planning scenarios.
  • It introduces a large-scale, hierarchically structured dataset and automated evaluation tools to rigorously assess constraint adherence, planning soundness, and dialogue fidelity.
  • The framework incorporates the GTPO online reinforcement learning algorithm, which employs turn-wise reward differencing and normalization to align agent actions with extended task goals.

TRIP-Bench is a benchmark framework for evaluating long-horizon interactive agents in complex, real-world task settings, with a primary use case in multi-city, multi-day travel planning. Developed to address the persistent shortcomings of prior benchmarks—namely, limited dialogue depth, inadequate constraint modeling, and non-realistic user interaction—TRIP-Bench provides a rigorously curated suite of evaluation scenarios, tools, and metrics intended to drive advances in robust, tool-augmented agent architectures. It introduces both a large-scale, hierarchically structured dataset drawn from real-world inventories and an automated evaluation system for scoring constraint adherence and task completion fidelity. TRIP-Bench further proposes a novel online reinforcement learning algorithm, GTPO, with specialized normalization and reward differencing, for training agents over extended interactions. The framework enables systematic assessment and development of LLM-based agents capable of navigating the combinatorial complexity of real-world, long-horizon decision-making and user interaction (Shen et al., 2 Feb 2026).

1. Conceptual Motivation and Benchmarking Gaps

TRIP-Bench was designed to address several key deficiencies in existing evaluation paradigms for LLM-based planning agents:

  • Lack of Long-Horizon Task Structure: Previous benchmarks such as TravelPlanner and TripTailor focus almost exclusively on single-turn tool use or short, scripted binges, rarely exceeding three tool calls per scenario and rarely requiring global consistency across multiple user-agent exchanges.
  • Inadequate Constraint and Multi-Tool Reasoning: Most prior datasets fail to enforce global constraints across a sequence of actions (e.g., cumulative budget, spatiotemporal compositionality, POI diversity) and do not require sophisticated orchestration of multiple specialized tools.
  • Static and Unsophisticated User Models: Surviving benchmarks with “multi-turn” structure often rely on pre-scripted, non-adaptive user simulators or fixed intent flows, lacking ambiguity, clarification, iterative plan revision, and real dialogue complexities.

TRIP-Bench specifically targets evaluation of agents on: (a) long-horizon completion under evolving preferences, (b) global constraint adherence, (c) multi-tool orchestration, and (d) adaptive handling of realistic user behaviors such as clarifications, intent shifts, rollbacks, and itinerary merges (Shen et al., 2 Feb 2026).

2. Dataset Construction, Tool Protocols, and Scenario Design

TRIP-Bench leverages a real-world corpus covering 40+ cities, a catalog of over 6,000 attractions, 80,000 hotels, 400,000 restaurants, and more than 1 million bookable products, all formatted and cleaned for consistency and agent-friendly JSON output. Dialogue scenarios span 2–7+ days and up to three cities per task, supporting a broad range of travel styles.

Scenario Requirements and Tooling

  • Travel Requirements: The framework synthesizes over 40 requirement categories (budget, distance, cuisine, timing, activity type), each with >80 paraphrased NL expressions, controlled by generator G(e)G(e) and validator V(e)V(e) interfaces for constraint mutation and precision checking.
  • Tool Suite: 18 curated APIs are provided for all principal subtasks—transport (flight, train), hotel, attraction, and restaurant search/details/coordinates, plus general utilities (route estimation, temporal logic, geospatial queries). These APIs support rich field filtering, sorting, pagination, and retrieval of product-level details necessary for granular itinerary assembly.

Dialogue Structure and Difficulty Stratification

Splits are stratified by difficulty:

Tier Days / Cities Constraints User Behaviors
Easy 2–5 / 2 2–6 add, modify, rollback, issue pointing
Medium 3–7 / 2–3 7–10 exploratory, clarification, correction
Hard-LIT 3–10 / 2–3 11–14 >10–15 turns, small incremental updates
Hard-FIT 3–10 / 2–3 11–14 infeasibility/feasibility transitions, rollbacks
Hard-AIS 3–10 / 2–3 11–14 ambiguous intent, style shifts, clarifications
Hard-PMR 3–10 / 2–3 11–14 plan merges/switches

Dialogues extend to 15 user turns, 150+ tool calls, and context windows of up to 200,000 tokens in the hardest cases (Shen et al., 2 Feb 2026).

3. Automated Evaluation Metrics and Framework

Constraint Taxonomy and Aggregate Metrics

Constraints are divided into three classes:

  • Basic Feasibility: Structural soundness, POI existence, daily completeness.
  • Planning Soundness: Spatiotemporal coherence (no overlapping POIs, spatial logic, route realism), experience diversity, and product details.
  • User Constraints: Specified in natural language and formalized via the V(e,i)V(e,i) validators.

For each dialogue:

  • FfeasF_{\mathrm{feas}}: Number of feasibility violations
  • FsoundF_{\mathrm{sound}}: Number of planning soundness violations
  • FuserF_{\mathrm{user}}: Number of user-constraint violations

Overall success (strict): $𝟙[(F_{\mathrm{feas}}=0) \wedge (F_{\mathrm{sound}}=0) \wedge (F_{\mathrm{user}}=0)]$

Overall success (loose): $𝟙[(F_{\mathrm{feas}}=0) \wedge (F_{\mathrm{sound}}\leq2) \wedge (F_{\mathrm{user}}\leq1)]$

Success and constraint satisfaction rates are defined as

SuccessRate=#successful dialogues#total dialogues\mathrm{SuccessRate} = \frac{\#\text{successful dialogues}}{\#\text{total dialogues}}

ConstraintSat=#satisfied constraints#requested constraints\mathrm{ConstraintSat} = \frac{\#\text{satisfied constraints}}{\#\text{requested constraints}}

Automated rule-based validators operate at both turn and dialogue level, with tool call masking for infeasible or out-of-budget queries.

4. Experimental Results, Failure Modes, and Insights

Zero-Shot and Reasoning-Enabled Performance

Empirical evaluation demonstrates major performance bottlenecks for current advanced models:

  • On the easy split, even the best models achieve ≤50% (loose) and ≤31% (strict) success.
  • On hard subsets, performance drops below 10% strict (e.g., 14% for LIT, 0% for FIT, 0% for AIS, ≤10% for PMR).
  • Reasoning-enabled approaches (“thinking”/CoT) improve loose criteria by 10–30 percentage points and strict by 5–20, but do not fundamentally bridge the gap, with hard splits remaining below 14% strict.

Dominant Failure Modes

The most significant sources of failure are:

  • Global Constraint Violations: Violation of budget, timing, or globally persistent requirements across multiple turns.
  • Tool Call and Coordination Errors: Incorrect filter usage, missing product-level details, and planning inconsistencies in composite queries.
  • Ambiguity and User Modeling Gaps: Difficulty with under-specified or shifting user instructions (notably in AIS and PMR subsets).

5. GTPO: Online Multi-Turn Reinforcement Learning for Long-Horizon Agents

Motivation

Conventional static supervised fine-tuning (SFT) and single-turn RL are insufficient to address distributional shift over extended, interactive agent-user sessions. GTPO (Grouped Turn-wise PPO) is proposed to explicitly address these obstacles.

Reward Formulation and Training Objective

At each turn tt in trajectory kk:

  • Raw reward: rt,raw(k)=Ifeas(k,t)1ItiItct,i(k)r^{(k)}_{t,\mathrm{raw}} = \mathbb{I}^{(k,t)}_{\mathrm{feas}}\,\frac{1}{|\mathcal{I}_t|}\sum_{i\in\mathcal{I}_t}c^{(k)}_{t,i}
  • Global-instruction normalization (per constraint ii): normalization over instruction instance:

c^t,i(k)=ct,i(k)μi(k)σi(k)+ϵ\hat{c}^{(k)}_{t,i} = \frac{c^{(k)}_{t,i} - \mu_i^{(k)}}{\sigma_i^{(k)}+\epsilon}

  • Turn-wise reward differencing: Δrt(k)=rt(k){rt1(k),if feasible maxkrt1(k),otherwise\Delta r^{(k)}_t = r^{(k)}_t - \begin{cases} r^{(k)}_{t-1}, & \text{if feasible} \ \max_{k'} r^{(k')}_{t-1}, & \text{otherwise} \end{cases}
  • Turn-level normalization yields the final advantage At(k)A^{(k)}_t.
  • GTPO PPO-style objective:

JGTPO(θ)=E[1Kk=1K1Tkt=1Tkj=1Lk,tmin(ρt,j(k)At(k),clip(ρt,j(k),1ϵ,1+ϵ)At(k))βDKL(πθπref)]J_{\mathrm{GTPO}}(\theta) = \mathbb{E}\bigg[\frac{1}{K} \sum_{k=1}^{K} \frac{1}{T_k} \sum_{t=1}^{T_k} \sum_{j=1}^{L_{k,t}} \min\left(\rho_{t,j}^{(k)}A_t^{(k)},\,\mathrm{clip}(\rho_{t,j}^{(k)},1-\epsilon,1+\epsilon)A_t^{(k)}\right) - \beta D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\bigg]

Training and Application

Qwen2.5-32B-Instruct was cold-started with 3,000 SFT trajectories and Toucan data, followed by RL with 7,040 on-policy rollouts. Hyperparameters included 8-way sampling, learning rate 10610^{-6}, batch size 32, clipping 0.2, and KL-coefficient 0.05.

6. Comparative Results and Broader Implications

Model Easy-Loose Easy-Strict Mid-Loose Mid-Strict
Qwen2.5-32B (base) 0% 0% 0% 0%
+SFT 32% 3% 5% 0%
+GTPO (full) 49% 21% 40% 5%
Gemini-3-Pro (thinking) 42% 11% 16% 0%
  • GTPO delivers significant gains over both SFT (+17/+18 pp on Easy) and Gemini-3-Pro (+7/+10 pp on Easy splits).
  • Reward shaping at turn and global levels, combined with on-policy simulation, is crucial for aligning local agent actions with long-horizon goals.
  • Dynamic user simulation mitigates covariate shift, ensuring robustness in unseen trajectories.

A plausible implication is that scalable, tool-augmented, multi-turn agent training protocols—combined with scenario-driven evaluations such as those enabled by TRIP-Bench—will be fundamental for progress in real-world, LLM-based planning systems.

7. Significance and Future Directions

TRIP-Bench establishes a high-fidelity, dialog-centric benchmark for evaluating the capabilities and limitations of interactive, tool-augmented, long-horizon agents. Performance ceilings observed even for state-of-the-art LLMs underscore the necessity of further architectural, dataset, and algorithmic innovation. The separation of success metrics into loose and strict regimes, systematic taxonomy of failure modes, and robust tooling for dialogue-scale validation, position TRIP-Bench as a reference testbed. The introduction and demonstrated efficacy of GTPO supports the importance of online, context-normalized RL in agent alignment. This suggests future advances may depend critically on richer user modeling, improved compositional constraint reasoning, and sample-efficient, interaction-centric training paradigms (Shen et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TRIP-Bench.