Papers
Topics
Authors
Recent
2000 character limit reached

Motion-Aware Encoder for Trajectory Forecasting

Updated 1 December 2025
  • Motion-Aware Encoder is a neural module that constructs scene-level intention priors and refined representations via clustered motion modes.
  • It employs bounded scaled additive attention and context-aware transformations to align agent intentions and improve prediction robustness.
  • The design integrates motion mode clustering with global context aggregation, enhancing trajectory forecasting accuracy and system interpretability.

A Motion-Aware Encoder (MAE) is a neural module that constructs scene-level intention priors and context-refined representations by exploiting clustered trajectory modes and global context aggregation, primarily for multimodal trajectory prediction without relying on external HD maps. It is a foundational component in recent map-free transformer architectures for vehicle motion forecasting, enabling explicit reasoning about agent intentions and uncertainty purely from spatiotemporal trajectory data and interactions. The MAE advances prediction robustness, intention alignment, and interpretability by integrating bounded scaled additive attention, mode conditioning, and context-aware transformations over learned trajectory banks (Chen et al., 24 Nov 2025).

1. Architectural Rationale and Core Design

The principal objective behind a Motion-Aware Encoder is to address limitations of pairwise attention mechanisms in trajectory models, which often result in the over-amplification of straight-line patterns and suppression of transition/turning behaviors, thereby misaligning intention inference. Instead of using HD map features, the MAE establishes a global context by aggregating multiple trajectory hypotheses (motion modes) that are pre-computed via unsupervised clustering (typically k-means) over observed vehicle futures. The encoder’s design ensures that both scene-level motion tendencies and agent-specific hypotheses are exploited for downstream reasoning.

During inference, the MAE processes the target agent’s observed trajectory and constructs M mode-embedded tokens by concatenating each motion mode prototype with the observed history. This enables the encoder to represent a diverse set of plausible futures conditioned on situational evidence.

2. Mode-Embedded Input Construction

Given the target agent’s observed trajectory XobsRTobs×2X_\mathrm{obs} \in \mathbb{R}^{T_\mathrm{obs} \times 2} and the kk-th motion mode MkRTpre×2M_k \in \mathbb{R}^{T_\mathrm{pre} \times 2}, the Motion-Aware Encoder forms mode-embedded tokens as:

Sk=[Xobs;Mk]R(Tobs+Tpre)×2S_k = [X_\mathrm{obs}; M_k] \in \mathbb{R}^{(T_\mathrm{obs} + T_\mathrm{pre}) \times 2}

These are subsequently flattened and projected:

Ek=WEkRdmodelE_k = W E'_k \in \mathbb{R}^{d_\mathrm{model}}

where EkE'_k is the flattened SkS_k and WW is a learnable weight matrix mapping inputs into a shared embedding space. This procedure generates a bank of context-sensitive mode embeddings {Ek}k=1M\{E_k\}_{k=1}^M.

3. Bounded Scaled Additive Aggregation for Global Context

The aggregation mechanism central to the MAE employs bounded additive attention to obtain a scene-level global context GG, which serves as an “intention prior.” For each mode embedding EkE_k: \begin{align*} u_k &= \tanh(W_qg E_k + W_kg E_k) \quad \in \mathbb{R}{d_k} \ a_k &= \text{softmax}k \left( \frac{u_k}{\sqrt{d_k}} \right) \ G &= \sum{k=1}M a_k \cdot Vg E_k \end{align*}

Here, Wqg,Wkg,VgRdmodel×dkW_q^g, W_k^g, V^g \in \mathbb{R}^{d_\mathrm{model} \times d_k} are learned projections and dk=dmodel/hd_k = d_\mathrm{model}/h. The tanh\tanh nonlinearity bounds attention scores, which is critical to mitigating mode suppression and stabilizing the training process.

4. Context-Aware Mode Transformation

After global aggregation, each EkE_k is refined via context-aware transformations that condition query and key spaces on the global prior GG: \begin{align*} q'_k &= W_q E_k + G \ k'_k &= W_k E_k + G \ s_k &= \tanh(q'_k + k'_k) \cdot (W_q E_k) / \sqrt{d_k} \end{align*} The final context-aligned mode embedding is given by:

ck=softmax(sk)[WvEk]+softmax(sk)Gc_k = \text{softmax}(s_k) \cdot [W_v E_k] + \text{softmax}(s_k) \cdot G

By stacking two MAE layers and concatenating their outputs across multiple attention heads, the encoder generates a set {ck}\{c_k\} that is globally and locally intention-aligned (Chen et al., 24 Nov 2025).

5. Application Contexts and Extension Potential

Current deployments of Motion-Aware Encoders are centered on map-free multimodal trajectory prediction for autonomous vehicles. In GContextFormer, MAE is followed by a Hierarchical Interaction Decoder (HID) that leverages dual-pathway cross-attention (standard and context-enhanced), further conditioned on neighbors and regulated via a gating mechanism.

The modular MAE architecture maintains extensibility toward:

  • Hierarchical temporal stacking for long-horizon forecasting
  • Learning richer motion-mode banks via end-to-end optimization (in lieu of clustering)
  • Adding auxiliary context channels for semantic or map cues
  • Facilitating multi-agent cooperative interaction modeling by stacking additional context-aware decoder layers

6. Empirical Validation and Performance Analysis

Experimental results on eight drone-captured highway-ramp scenarios (TOD-VT dataset) demonstrate that GContextFormer, incorporating MAE, outperforms five state-of-the-art map-free baselines with improvements in minimum Average Displacement Error (minADE), minimum Final Displacement Error (minFDE), Miss Rate (MR@2m, MR@3m), and risk metrics (CVaR80%). For instance, on aggregated scenarios: minADE reduced from 0.69m (TUTR) to 0.63m; minFDE from 1.50m to 1.25m; Miss Rate drops reached up to 26.6% (MR@2m) and 29.3% (MR@3m); CVaR80% was reduced from 3.95m to 3.38m (Chen et al., 24 Nov 2025).

Spatially-resolved heatmap analysis shows concentrated gains around ramp curves and lane-change decision points, evidencing the MAE’s robustness in conditions of high motion ambiguity and complex dynamics.

7. Interpretability and Analysis of Motion Reasoning

The explicit computation of mode-context attention and context-aware transformations enables layer-wise interpretability: researchers can visualize which geometric prototypes are weighted at each MAE layer and monitor reasoning pathways throughout the decoding stack. The HID’s neighbor-context priors and gated cross-attention reveal detailed attribution, such as global agent saliency rankings and per-mode neighbor attention modulations. The architecture’s transparency facilitates tracing prediction outcomes to specific latent mode preferences and interaction influences, providing intrinsic explainability for vehicle motion forecasting systems (Chen et al., 24 Nov 2025).


A Motion-Aware Encoder thus serves as an advanced global context constructor and mode conditioner for multimodal, intention-aligned trajectory prediction in map-free domains, combining bounded additive aggregation, context-aware transformation, and robust downstream interaction modeling for state-of-the-art performance and interpretability.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Motion-Aware Encoder.