TrajBench: Unified Trajectory Benchmark Suite

Updated 3 July 2026

TrajBench is a unified framework standardizing trajectory benchmarks by integrating diverse datasets, protocols, and evaluation metrics.
It employs modular architectures, pipelined data processing, and standardized APIs for reproducible and fair model evaluation.
Key tasks include trajectory forecasting, generation, anomaly detection, and language-grounded understanding across robotics and urban mobility.

TrajBench is a consolidated term used within the research community to denote a unified suite of benchmarks, datasets, protocols, and evaluation frameworks for trajectory-centric machine learning. The term encompasses a broad class of benchmarks covering trajectory forecasting, generation, anomaly detection, language-grounded trajectory understanding, and agentic process supervision, spanning robotics, transportation, and LLM tool use. TrajBench integrates methodological developments from trajectory prediction, urban mobility synthesis, crowd navigation, procedural agent auditing, and multimodal alignment, each instantiated in specialized benchmarking frameworks adhering to rigorous, reproducible evaluation standards.

1. Origin, Definition, and Scope

The concept of TrajBench emerged from the need to standardize evaluation of trajectory-related models by eliminating inconsistencies in data preprocessing, scenario partitioning, and metric computation. Major efforts such as STEP (Structured Training and Evaluation Platform) and CityTrajBench explicitly brand their standardized protocols as "TrajBench" within the autonomous vehicle, urban mobility, and multi-agent systems domains (Schumann et al., 18 Sep 2025, Zhu et al., 1 Jun 2026). The term has also proliferated into specialized domains including LLM-based agent supervision (trajectory anomaly detection), tool-use diagnostics, and language-grounded trajectory understanding (Liu et al., 6 Feb 2026, He et al., 6 Oct 2025, Li et al., 11 May 2026).

TrajBench benchmarks are characterized by:

Modular architectures allowing dataset, model, and metric plug-ins
Unified data representations and scenario sampling schemes
Strictly reproducible and transparent experimental protocols

2. Structure of Modern TrajBench Suites

The core structure underlying leading TrajBench frameworks such as STEP and CityTrajBench employs a pipelined architecture with modules for data ingestion, preprocessing/normalization, scenario extraction, model adaptation, training, prediction, and multi-level evaluation (Schumann et al., 18 Sep 2025, Zhu et al., 1 Jun 2026). Typical components include:

Module	Functionality	Example Implementation
Dataset	Unified data loader and transformer	Accepts heterogeneous formats (JSON, ROS)
Scenario	Extraction of (past, future) pairs or trip segments	Configurable temporal horizons
Model	Interface standardization (input/output signatures)	Supports stochastic and joint models
Metric	ADE/FDE, distributional, geometric, process, and agentic metrics	Batch and aggregate levels
Perturbation	Robustness/attack scenario generator	Adversarial or random perturbations

CityTrajBench, for instance, mandates fixed split rules (70/15/15 by trip), trajectory normalization (e.g., length- $L=200$ by interpolation/truncation), and shared post-processing for all model outputs (e.g., road-graph projection), to eliminate protocol artifacts (Zhu et al., 1 Jun 2026). STEP formalizes standardized APIs and plugin interfaces (DL, DT, MB, EC, EF) for extensibility (Schumann et al., 18 Sep 2025).

3. Benchmark Tasks and Evaluation Protocols

TrajBench encompasses a wide array of tasks:

Trajectory Forecasting: Multi-agent position prediction given egocentric or BEV inputs, as in JRDB-Traj (crowd navigation) or EgoTraj-Bench (ego-view noise) (Saadatnejad et al., 2023, Liu et al., 1 Oct 2025).
City-Scale Trajectory Generation: Unconditional/conditional generation of realistic city-scale taxi or EV route traces, with evaluation on spatial, geometric, and agent-level metrics (Zhu et al., 1 Jun 2026).
Travel Mode Detection: Supervised classification of transport modality from GPS sequence features (e.g., walking vs. bicycling), requiring open, labeled datasets (Chen et al., 2021).
Language-Grounded Tasks: Alignment between urban trajectories and natural-language intents/queries/captions (instruction-conditioned generation, retrieval, captioning) (Li et al., 11 May 2026).
Agentic Tool-Use Process Evaluation: Stepwise tracking of LLM-based agent tool calls, evaluating selection, argument correctness, and order under complex trajectory plans (He et al., 6 Oct 2025).
Procedural Anomaly Detection: Fine-grained detection and localization of trajectory anomalies for agent rollback and trustworthy supervision (Liu et al., 6 Feb 2026).

Evaluation protocols are systematically shared, specifying scenario parameters (input/output horizons, discretization), splits, and preprocessing. Benchmarks report diverse metrics including displacement errors (ADE, FDE), geometric similarity (DTW, Fréchet), distributional fidelity (JSDs), conditional OD statistics, process-step exact matches (JEM), and trajectory-level diagnostic scores.

4. Canonical Datasets and Model Families

TrajBench frameworks integrate a diverse set of real-world datasets and model families. For forecasting, crowd navigation datasets (JRDB-Traj) and multimodal sensor streams are standard (Saadatnejad et al., 2023). City-scale generation benchmarks include Chengdu Taxi, Porto Taxi, and Shanghai EV, each standardized for map, temporal, and feature representation (Zhu et al., 1 Jun 2026). Supported model families include:

Statistical (Markov chain, region transitions)
VAE-based (TrajVAE)
GAN-based (TrajGAN)
Diffusion-based (DiffTraj, DiffRNTraj)
Flow-matching (TrajFlow)
Language-grounded hybrid retrieval+LLM (TrajAnchor, TrajRap, TrajFuse) (Li et al., 11 May 2026)
LLM tool-calling agents (TRAJECT-Bench)
Process anomaly verifiers (TrajAD)

Adherence to unified adapters ensures fair cross-model comparison; ablation and cross-protocol studies confirm the trade-offs among model expressivity, inference cost, and metric performance.

5. Metrics and Multi-Level Evaluation Principles

TrajBench mandates multi-level, multi-faceted evaluation to avoid overfitting to narrow performance targets:

Macro-Level: Grid-cell density JSD, OD trip distributions, and spatial coverage (PatternScore, DensityError) (Zhu et al., 1 Jun 2026)
Micro/Trajectory-Level: DTW, Fréchet, Jaccard overlap, EFE (end-to-end forecast error), collision/miss rates, and minimum-mode statistics (minADE/M)
Conditional/Agentic: Conditional OD fidelity, dependency/order satisfaction for tool-use, usage precision and inclusion, trajectory anomaly JEM (He et al., 6 Oct 2025, Liu et al., 6 Feb 2026)
Process Metrics: Binary classification (precision/recall/F₁), step-level anomaly localization, runtime re-verification protocols (Liu et al., 6 Feb 2026)
Language Alignment: Destination hit/matching, Recall@K, MRR, POI recall, groundedness (BLEU, METEOR, ROUGE-L, BERTScore F1) (Li et al., 11 May 2026)

Evaluation scripts and utilities are distributed alongside splits and code, and per-seed variability is commonly reported (e.g., ADE=0.91±0.02 m) (Schumann et al., 18 Sep 2025).

6. Extending TrajBench: Robustness, Fairness, and Future Tasks

TrajBench frameworks explicitly assess robustness under distribution shift, adversarial perturbation, and sensor noise (Schumann et al., 18 Sep 2025, Liu et al., 1 Oct 2025). For example, adversarial attacks increase minADE in joint-agent trajectory forecasting by up to +18%, while adversarial training recovers 10–15% resilience (Schumann et al., 18 Sep 2025). Process-level anomaly benchmarks stress the necessity of fine-grained, step-localized verification for trustworthy agent deployment (Liu et al., 6 Feb 2026). A plausible implication is that future TrajBench iterations will extend to multimodal and interactive agentic scenarios (branching tool-graphs, continuous anomaly mining) and further standardize fair scenario sampling, dynamic retrieval protocols, and context-efficient inference (He et al., 6 Oct 2025, Zhu et al., 1 Jun 2026).

7. Significance and Best Practices

TrajBench’s unification of data, evaluation, and implementation protocols facilitates fair comparison, reproducible research, and cumulative progress across multi-agent mobility, robotics, and LLM-based autonomy. Best practices recommended across TrajBench frameworks include: