Trajectory-Level Quality Metrics Overview

Updated 28 March 2026

Trajectory-level quality metrics are quantitative measures that assess entire sequences of states, emphasizing fidelity, utility, and realism.
They encompass classical displacement methods, planning-aware metrics, and set-based evaluations to capture temporal dynamics and safety-critical errors.
These metrics are crucial for applications in robotics, computer vision, and tracking, providing actionable insights into prediction accuracy and downstream impacts.

A trajectory-level quality metric is any quantitative measure that assesses the fidelity, utility, or realism of entire sequences of states (trajectories), as opposed to framewise or pointwise comparisons. Across robotics, computer vision, motion synthesis, tracking, and reinforcement learning, these metrics are fundamental in comparing, optimizing, and interpreting models whose outputs are temporally extended behaviors. This article surveys the key classes of trajectory-level quality metrics, their theoretical basis, computational forms, and roles in contemporary research.

1. Classical and Task-Agnostic Trajectory Metrics

Historically, the most widely used trajectory-level metrics in prediction and tracking are displacement- or likelihood-based, which are generally "task-agnostic."

Average Displacement Error (ADE): Measures mean Euclidean distance between predicted and true trajectory positions over time:

$\mathrm{ADE} = \frac{1}{T} \sum_{t=1}^T \|\hat{s}(t) - s(t)\|_2$

Final Displacement Error (FDE): Distance at the final time step:

$\mathrm{FDE} = \|\hat{s}(T) - s(T)\|_2$

Negative Log-Likelihood (NLL): If the model predicts a trajectory distribution $p(\cdot)$ :

$\mathrm{NLL} = -\log p(s_{1:T})$

Best-of- $K$ variants: When the model predicts $K$ samples (e.g., in multimodal forecasting), one often reports $\min$ ADE/FDE over $K$ (Ivanovic et al., 2021).

These metrics are straightforward but limited. They treat all displacement errors identically, ignore task context, cannot distinguish "harmless" from "critical" mistakes, and are blind to how errors might amplify in closed-loop planning. For example, two predictions that score identically by ADE may pose vastly different collision risks to an autonomous agent.

2. Task-Aware and Planning-Informed Metrics

To overcome the deficiencies of task-agnostic metrics, recent work introduces task- and planning-aware evaluations.

Planning-Informed Metric (PI-Metric):

Defined by reweighting standard metrics by each trajectory's planning sensitivity—the impact of a prediction error on a downstream planner's cost. The core procedure is:

Learn a cost function $c(s(t), u(t), \hat{s}(t+1:T)) = \theta^T \phi(s(t), u(t), \hat{s}(t+1:T))$ , with weights $\theta$ fitted by inverse optimal control, and features $\phi$ encoding quantities like goal-progress, control effort, collision risks.
Compute the gradient norm $g = \|\partial c / \partial \hat{s}\|_1$ for each predicted trajectory, quantifying sensitivity.
Reweight a standard metric, e.g.,

$\mathrm{PI\text{-}Metric} = \frac{1}{|\mathcal{A}|} \sum_{a \in \mathcal{A}} f(a, g) \cdot \mathrm{Metric}(\hat{s}_a, s_a)$

where $f(a, g)$ increases weight on planning-critical agents/trajectories.

Empirical examples show PI-Metric penalizes dangerous prediction errors (e.g., those causing potential collisions) far more than benign ones, even when raw ADE/FDE is identical (Ivanovic et al., 2021).

Scenario-Driven Multi-Criteria Metrics:

A more recent approach evaluates both accuracy and diversity—reflecting the need for diverse plausible predictions in interactive or ambiguous scenarios:

Error (E_error): Mean per-mode, per-timestep displacement error, generalizing ADE to multimodal outputs.
Diversity (GMM-Area Diversity, GAD): The average spatial spread of a Gaussian Mixture fitted to predicted endpoints over time, quantifying how broadly the model covers the space of plausible futures.
Scenario Criticality ( $P_c$ ): A learned scalar via a GCN-LSTM network that encodes context criticality (e.g., intersection vs. highway), automatically adjusting the mix between accuracy and diversity.

The final score is:

$\mathrm{Score} = P_c \cdot \mathrm{GAD} + (1-P_c) \cdot E_{\mathrm{error}}$

Critically, this adaptively emphasizes accuracy or diversity depending on the scenario (Da et al., 13 Dec 2025).

3. Topological and Semantics-Aware Metrics

Trajectory quality must sometimes be evaluated against environmental constraints or semantic structures. These include:

Lane Distance-Based Metrics (Lane Miss Rate, LMR):

Standard Euclidean miss rate fails to penalize predictions that end on the wrong road or lane, even if spatially close. LMR assigns endpoints to lane centerlines and computes the along-lane shortest-path distance. Misses are counted if this lane-distance exceeds a velocity-dependent threshold:

$\mathrm{LMR}_{@1} = \frac{1}{N}\sum_{i=1}^N m^i_1,\quad \mathrm{LMR}_{@k} = \frac{1}{N}\sum_{i=1}^N \Bigl(\prod_{j=1}^k m^i_j\Bigr)$

LMR thus distinguishes topologically wrong predictions (e.g., driving on a parallel road) from tolerable longitudinal drift along the intended path (Schmidt et al., 2023).

Trajectory Distribution Metrics (AMD, AMV):

In multimodal prediction, metrics that only consider the best match (ADE/FDE) ignore the distribution. AMD (Average Mahalanobis Distance) and AMV (Average Maximum Eigenvalue) respectively quantify the distance from the ground-truth to the center of the cloud (robustness) and the spread (uncertainty) of predicted trajectories:

$\mathrm{AMD} = \frac{1}{N T_p} \sum_{n,t} \sqrt{(p_t^n - \mu_t^n)^\top \Sigma_t^{-1} (p_t^n - \mu_t^n)},\quad \mathrm{AMV} = \frac{1}{N T_p} \sum_{n,t} \lambda_{\max}(\Sigma_t^n)$

(Mohamed et al., 2022)

4. Trajectory Set and Structural Metrics

Trajectory-level quality can also be formulated as a metric on sets of trajectories, encompassing issues of localization, missed/false detections, and track identity switches.

Multi-Dimensional Assignment Metrics:

Metrics such as the "Trajectory GOSPA" and its probabilistic generalization penalize for each time step:

Localization errors for properly detected targets.
Missed and false targets (unassigned).
Track switches, with user-set penalties.

Given suitable cost decomposition and assignment constraints, these metrics are true metrics on the space of finite trajectory sets. Efficient linear programming relaxations exist and are polynomial-time computable (García-Fernández et al., 2016, Xia et al., 18 Jun 2025, García-Fernández et al., 2021).

Practical, Mathematically Consistent Set Metrics:

The metric $D_{\mathrm{comp}}$ minimizes, over fractional assignments, the sum of instantaneous matching costs and a regularization on assignment changes (identity switches), tuning the trade-off between distance and switch penalty:

$D_{\mathrm{comp}}(A,B) = \min_{W(t)\in \mathcal{P}} \sum_t \mathrm{tr}[W(t)^\top D^{AB}(t)] + \beta \sum_{t=1}^{T-1} \|W(t+1) - W(t)\|_1$

This yields a Pareto frontier of localization vs. switching cost (Bento et al., 2016).

5. No-Reference and Perceptually-Aligned Metrics

Not all environments provide ground-truth for comparison. No-reference and perceptually-validated metrics provide alternative means to quantify trajectory quality.

Mutually Orthogonal Metric (MOM):

Assesses trajectory quality by evaluating the variance of map points on mutually orthogonal planar surfaces from registered point cloud data, exploiting the empirical correlation with relative pose error (RPE) in mapping (Kornilova et al., 2021).

Self-Quality Evaluation (SQE):

For multi-object tracking, SQE fits a two-component Gaussian mixture model over intra-trajectory feature distances, identifying fragmented/false tracks in an unsupervised fashion. The statistic (number of tracks, mean length, and outlier flags) quantitatively tracks tracking performance without ground-truth (Huang et al., 2020).

Perceptually-Validated Crowd Trajectory Metric (QF):

QF combines 21 features (individual, interactional, global) chosen for perceptual relevance and aggregates their deviation from real-data statistics via a weighted sum, with weights learned to maximize agreement with human realism judgments (Daniel et al., 2021).

6. Domain-Specific and Application-Driven Metrics

Complex domains may require tailored metrics:

Handwriting Trajectory Recovery:

AIoU (Adaptive Intersection over Union) addresses stroke width discrepancies, and LDTW (Length-Independent Dynamic Time Warping) normalizes alignment error by path length, yielding simultaneously glyph- and order-aware performance assessment (Chen et al., 2022).

Reinforcement Learning Trajectory Importance:

Trajectory ranking through per-state "importance" scores (combining Q-value gaps and value-based goal affinity) allows ranking and explanation of agent behavior in long-horizon control (F et al., 7 Dec 2025).

Tool-Use and Agentic Evaluation:

Metrics for LLM tool-use include set-based tool selection correctness, argument accuracy, dependency satisfaction, and call order satisfaction, providing fine-grained scoring of trajectory-level decision quality in agentic pipelines (He et al., 6 Oct 2025).

Video and Motion Synthesis:

Trajectory autoencoding (e.g., TRAJAN) reconstructs point tracks from video, producing motion-centric distances that correlate with humans' perception of temporal consistency and localizes errors in time and space (Allen et al., 30 Apr 2025). For free-viewpoint video, EM-VQM uses elastic metrics on motion trajectories and temporal structure features for full-reference video quality (Ling et al., 2019).

7. Practical Considerations and Outlook

Metric selection is application-dependent:

Planning and safety-critical systems require metrics sensitive to downstream consequences and environmental semantics (e.g., PI-Metric, LMR, AMD/AMV).
Tracking and video analysis benefit from set-metrics and structural penalties that capture detection, association, and identity robustness under noise.
Perceptual realism in crowds or generated videos relies on perceptually validated or autoencoder-based trajectory statistics.
Test-time, no-GT contexts necessitate unsupervised or no-reference surrogates (MOM, SQE).

Scalability, interpretability, and differentiability are active concerns. Many modern approaches support polynomial-time (often LP-based) computation and can be integrated into training or optimization pipelines directly. Perspective continues to shift from rigid, generic error metrics to adaptive, context-aware metrics which better reflect operational priorities and human judgment.

References:

"Rethinking Trajectory Forecasting Evaluation" (Ivanovic et al., 2021)
"Measuring What Matters: Scenario-Driven Evaluation for Trajectory Predictors in Autonomous Driving" (Da et al., 13 Dec 2025)
"SQE: a Self Quality Evaluation Metric for Parameters Optimization in Multi-Object Tracking" (Huang et al., 2020)
"LMR: Lane Distance-Based Metric for Trajectory Prediction" (Schmidt et al., 2023)
"Know your Trajectory – Trustworthy Reinforcement Learning deployment through Importance-Based Trajectory Analysis" (F et al., 7 Dec 2025)
"Direct Motion Models for Assessing Generated Videos" (Allen et al., 30 Apr 2025)
"A metric on the space of finite sets of trajectories for evaluation of multi-target tracking algorithms" (García-Fernández et al., 2016)
"Probabilistic Trajectory GOSPA: A Metric for Uncertainty-Aware Multi-Object Tracking Performance Evaluation" (Xia et al., 18 Jun 2025)
"Camera Trajectory Generation: A Comprehensive Survey of Methods, Metrics, and Future Directions" (Dehghanian et al., 1 Jun 2025)
"Be your own Benchmark: No-Reference Trajectory Metric on Registered Point Clouds" (Kornilova et al., 2021)
"A metric for sets of trajectories that is practical and mathematically consistent" (Bento et al., 2016)
"A time-weighted metric for sets of trajectories to assess multi-object tracking algorithms" (García-Fernández et al., 2021)
"TRAJECT-Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use" (He et al., 6 Oct 2025)
"Complex Handwriting Trajectory Recovery: Evaluation Metrics and Algorithm" (Chen et al., 2022)
"What's Wrong with the Absolute Trajectory Error?" (Lee et al., 2022)
"Social-Implicit: Rethinking Trajectory Prediction Evaluation and The Effectiveness of Implicit Maximum Likelihood Estimation" (Mohamed et al., 2022)
"A Perceptually-Validated Metric for Crowd Trajectory Quality Evaluation" (Daniel et al., 2021)
"Quality Assessment of Free-viewpoint Videos by Quantifying the Elastic Changes of Multi-Scale Motion Trajectories" (Ling et al., 2019)