TRIP-Bench: Long-Horizon Planning Benchmark

Updated 9 February 2026

TRIP-Bench is a long-horizon benchmark framework designed to evaluate interactive agents in complex, multi-city, multi-day travel planning scenarios.
It introduces a large-scale, hierarchically structured dataset and automated evaluation tools to rigorously assess constraint adherence, planning soundness, and dialogue fidelity.
The framework incorporates the GTPO online reinforcement learning algorithm, which employs turn-wise reward differencing and normalization to align agent actions with extended task goals.

TRIP-Bench is a benchmark framework for evaluating long-horizon interactive agents in complex, real-world task settings, with a primary use case in multi-city, multi-day travel planning. Developed to address the persistent shortcomings of prior benchmarks—namely, limited dialogue depth, inadequate constraint modeling, and non-realistic user interaction—TRIP-Bench provides a rigorously curated suite of evaluation scenarios, tools, and metrics intended to drive advances in robust, tool-augmented agent architectures. It introduces both a large-scale, hierarchically structured dataset drawn from real-world inventories and an automated evaluation system for scoring constraint adherence and task completion fidelity. TRIP-Bench further proposes a novel online reinforcement learning algorithm, GTPO, with specialized normalization and reward differencing, for training agents over extended interactions. The framework enables systematic assessment and development of LLM-based agents capable of navigating the combinatorial complexity of real-world, long-horizon decision-making and user interaction (Shen et al., 2 Feb 2026).

1. Conceptual Motivation and Benchmarking Gaps

TRIP-Bench was designed to address several key deficiencies in existing evaluation paradigms for LLM-based planning agents:

Lack of Long-Horizon Task Structure: Previous benchmarks such as TravelPlanner and TripTailor focus almost exclusively on single-turn tool use or short, scripted binges, rarely exceeding three tool calls per scenario and rarely requiring global consistency across multiple user-agent exchanges.
Inadequate Constraint and Multi-Tool Reasoning: Most prior datasets fail to enforce global constraints across a sequence of actions (e.g., cumulative budget, spatiotemporal compositionality, POI diversity) and do not require sophisticated orchestration of multiple specialized tools.
Static and Unsophisticated User Models: Surviving benchmarks with “multi-turn” structure often rely on pre-scripted, non-adaptive user simulators or fixed intent flows, lacking ambiguity, clarification, iterative plan revision, and real dialogue complexities.

TRIP-Bench specifically targets evaluation of agents on: (a) long-horizon completion under evolving preferences, (b) global constraint adherence, (c) multi-tool orchestration, and (d) adaptive handling of realistic user behaviors such as clarifications, intent shifts, rollbacks, and itinerary merges (Shen et al., 2 Feb 2026).

2. Dataset Construction, Tool Protocols, and Scenario Design

TRIP-Bench leverages a real-world corpus covering 40+ cities, a catalog of over 6,000 attractions, 80,000 hotels, 400,000 restaurants, and more than 1 million bookable products, all formatted and cleaned for consistency and agent-friendly JSON output. Dialogue scenarios span 2–7+ days and up to three cities per task, supporting a broad range of travel styles.

Scenario Requirements and Tooling

Travel Requirements: The framework synthesizes over 40 requirement categories (budget, distance, cuisine, timing, activity type), each with >80 paraphrased NL expressions, controlled by generator $G(e)$ and validator $V(e)$ interfaces for constraint mutation and precision checking.
Tool Suite: 18 curated APIs are provided for all principal subtasks—transport (flight, train), hotel, attraction, and restaurant search/details/coordinates, plus general utilities (route estimation, temporal logic, geospatial queries). These APIs support rich field filtering, sorting, pagination, and retrieval of product-level details necessary for granular itinerary assembly.

Dialogue Structure and Difficulty Stratification

Splits are stratified by difficulty:

Tier	Days / Cities	Constraints	User Behaviors
Easy	2–5 / 2	2–6	add, modify, rollback, issue pointing
Medium	3–7 / 2–3	7–10	exploratory, clarification, correction
Hard-LIT	3–10 / 2–3	11–14	>10–15 turns, small incremental updates
Hard-FIT	3–10 / 2–3	11–14	infeasibility/feasibility transitions, rollbacks
Hard-AIS	3–10 / 2–3	11–14	ambiguous intent, style shifts, clarifications
Hard-PMR	3–10 / 2–3	11–14	plan merges/switches

Dialogues extend to 15 user turns, 150+ tool calls, and context windows of up to 200,000 tokens in the hardest cases (Shen et al., 2 Feb 2026).

3. Automated Evaluation Metrics and Framework

Constraint Taxonomy and Aggregate Metrics

Constraints are divided into three classes:

Basic Feasibility: Structural soundness, POI existence, daily completeness.
Planning Soundness: Spatiotemporal coherence (no overlapping POIs, spatial logic, route realism), experience diversity, and product details.
User Constraints: Specified in natural language and formalized via the $V(e,i)$ validators.

For each dialogue:

$F_{\mathrm{feas}}$ : Number of feasibility violations
$F_{\mathrm{sound}}$ : Number of planning soundness violations
$F_{\mathrm{user}}$ : Number of user-constraint violations

Overall success (strict): $𝟙[(F_{\mathrm{feas}}=0) \wedge (F_{\mathrm{sound}}=0) \wedge (F_{\mathrm{user}}=0)]$

Overall success (loose): $𝟙[(F_{\mathrm{feas}}=0) \wedge (F_{\mathrm{sound}}\leq2) \wedge (F_{\mathrm{user}}\leq1)]$

Success and constraint satisfaction rates are defined as

$\mathrm{SuccessRate} = \frac{\#\text{successful dialogues}}{\#\text{total dialogues}}$

$\mathrm{ConstraintSat} = \frac{\#\text{satisfied constraints}}{\#\text{requested constraints}}$

Automated rule-based validators operate at both turn and dialogue level, with tool call masking for infeasible or out-of-budget queries.

4. Experimental Results, Failure Modes, and Insights

Zero-Shot and Reasoning-Enabled Performance

Empirical evaluation demonstrates major performance bottlenecks for current advanced models:

On the easy split, even the best models achieve ≤50% (loose) and ≤31% (strict) success.
On hard subsets, performance drops below 10% strict (e.g., 14% for LIT, 0% for FIT, 0% for AIS, ≤10% for PMR).
Reasoning-enabled approaches (“thinking”/CoT) improve loose criteria by 10–30 percentage points and strict by 5–20, but do not fundamentally bridge the gap, with hard splits remaining below 14% strict.

Dominant Failure Modes

The most significant sources of failure are:

Global Constraint Violations: Violation of budget, timing, or globally persistent requirements across multiple turns.
Tool Call and Coordination Errors: Incorrect filter usage, missing product-level details, and planning inconsistencies in composite queries.
Ambiguity and User Modeling Gaps: Difficulty with under-specified or shifting user instructions (notably in AIS and PMR subsets).

5. GTPO: Online Multi-Turn Reinforcement Learning for Long-Horizon Agents

Motivation

Conventional static supervised fine-tuning (SFT) and single-turn RL are insufficient to address distributional shift over extended, interactive agent-user sessions. GTPO (Grouped Turn-wise PPO) is proposed to explicitly address these obstacles.

Reward Formulation and Training Objective

At each turn $t$ in trajectory $k$ :

Raw reward: $r^{(k)}_{t,\mathrm{raw}} = \mathbb{I}^{(k,t)}_{\mathrm{feas}}\,\frac{1}{|\mathcal{I}_t|}\sum_{i\in\mathcal{I}_t}c^{(k)}_{t,i}$
Global-instruction normalization (per constraint $i$ ): normalization over instruction instance:

$\hat{c}^{(k)}_{t,i} = \frac{c^{(k)}_{t,i} - \mu_i^{(k)}}{\sigma_i^{(k)}+\epsilon}$

Turn-wise reward differencing: $\Delta r^{(k)}_t = r^{(k)}_t - \begin{cases} r^{(k)}_{t-1}, & \text{if feasible} \ \max_{k'} r^{(k')}_{t-1}, & \text{otherwise} \end{cases}$
Turn-level normalization yields the final advantage $A^{(k)}_t$ .
GTPO PPO-style objective:

$J_{\mathrm{GTPO}}(\theta) = \mathbb{E}\bigg[\frac{1}{K} \sum_{k=1}^{K} \frac{1}{T_k} \sum_{t=1}^{T_k} \sum_{j=1}^{L_{k,t}} \min\left(\rho_{t,j}^{(k)}A_t^{(k)},\,\mathrm{clip}(\rho_{t,j}^{(k)},1-\epsilon,1+\epsilon)A_t^{(k)}\right) - \beta D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{ref}})\bigg]$

Training and Application

Qwen2.5-32B-Instruct was cold-started with 3,000 SFT trajectories and Toucan data, followed by RL with 7,040 on-policy rollouts. Hyperparameters included 8-way sampling, learning rate $10^{-6}$ , batch size 32, clipping 0.2, and KL-coefficient 0.05.

6. Comparative Results and Broader Implications

Model	Easy-Loose	Easy-Strict	Mid-Loose	Mid-Strict
Qwen2.5-32B (base)	0%	0%	0%	0%
+SFT	32%	3%	5%	0%
+GTPO (full)	49%	21%	40%	5%
Gemini-3-Pro (thinking)	42%	11%	16%	0%

GTPO delivers significant gains over both SFT (+17/+18 pp on Easy) and Gemini-3-Pro (+7/+10 pp on Easy splits).
Reward shaping at turn and global levels, combined with on-policy simulation, is crucial for aligning local agent actions with long-horizon goals.
Dynamic user simulation mitigates covariate shift, ensuring robustness in unseen trajectories.

A plausible implication is that scalable, tool-augmented, multi-turn agent training protocols—combined with scenario-driven evaluations such as those enabled by TRIP-Bench—will be fundamental for progress in real-world, LLM-based planning systems.

7. Significance and Future Directions

TRIP-Bench establishes a high-fidelity, dialog-centric benchmark for evaluating the capabilities and limitations of interactive, tool-augmented, long-horizon agents. Performance ceilings observed even for state-of-the-art LLMs underscore the necessity of further architectural, dataset, and algorithmic innovation. The separation of success metrics into loose and strict regimes, systematic taxonomy of failure modes, and robust tooling for dialogue-scale validation, position TRIP-Bench as a reference testbed. The introduction and demonstrated efficacy of GTPO supports the importance of online, context-normalized RL in agent alignment. This suggests future advances may depend critically on richer user modeling, improved compositional constraint reasoning, and sample-efficient, interaction-centric training paradigms (Shen et al., 2 Feb 2026).

Markdown Upgrade to Chat

References (1)

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TRIP-Bench.

TRIP-Bench: Long-Horizon Planning Benchmark

1. Conceptual Motivation and Benchmarking Gaps

2. Dataset Construction, Tool Protocols, and Scenario Design

Scenario Requirements and Tooling

Dialogue Structure and Difficulty Stratification

3. Automated Evaluation Metrics and Framework

Constraint Taxonomy and Aggregate Metrics

4. Experimental Results, Failure Modes, and Insights

Zero-Shot and Reasoning-Enabled Performance

Dominant Failure Modes

5. GTPO: Online Multi-Turn Reinforcement Learning for Long-Horizon Agents

Motivation

Reward Formulation and Training Objective

Training and Application

6. Comparative Results and Broader Implications

7. Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

TRIP-Bench: Long-Horizon Planning Benchmark

1. Conceptual Motivation and Benchmarking Gaps

2. Dataset Construction, Tool Protocols, and Scenario Design

Scenario Requirements and Tooling

Dialogue Structure and Difficulty Stratification

3. Automated Evaluation Metrics and Framework

Constraint Taxonomy and Aggregate Metrics

4. Experimental Results, Failure Modes, and Insights

Zero-Shot and Reasoning-Enabled Performance

Dominant Failure Modes

5. GTPO: Online Multi-Turn Reinforcement Learning for Long-Horizon Agents

Motivation

Reward Formulation and Training Objective

Training and Application

6. Comparative Results and Broader Implications

7. Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research