TENET Temporal Flow Header

Updated 17 December 2025

Temporal Flow Header (TFH) is a transformer-based module that enforces backward consistency by reconstructing historical states from predicted future features.
TFH uses a compact feature-pyramid network to fuse multi-scale temporal patterns, improving accuracy in trajectory predictions for autonomous driving.
Integrated as a parallel head in TENET, TFH contributes to state-of-the-art results on challenges like Argoverse 2 by ensuring consistent multimodal forecasts.

The Temporal Flow Header (TFH) is an auxiliary architectural component introduced in the TENET (Transformer Encoding Network for Effective Temporal Flow) motion prediction approach for autonomous driving. TFH addresses the challenge of ensuring backward consistency in predicted multimodal trajectories by reconstructing observed historical states from predicted future representations within a transformer-based pipeline. This closed-loop, reverse-prediction mechanism leverages a compact feature-pyramid architecture over future timesteps to enforce that the decoded future encodes sufficient information about the known past, empirically improving trajectory fidelity and multi-modality. TFH forms part of a multi-head output parallel to trajectory regression and scoring heads and directly contributed to TENET's top performance on the Argoverse 2 Motion Forecasting Challenge (Wang et al., 2022).

1. High-Level Architecture and TFH Integration

TENET is structured around a transformer-based encoder-decoder that processes spatiotemporal agent and map tensors for motion prediction. The encoder ingests agent-trajectory tensors of shape $[N_{\mathrm{agents}} \times T_{\mathrm{hist}} \times d_A]$ and HD-map tensors $[N_{\mathrm{map}} \times L \times d_M]$ . Temporal ( $T_{\mathrm{hist}}$ ) and agent-wise self-attention, along with cross-attention between agents and map, yield a fused representation. The decoder employs $K$ learnable trajectory tokens $x_k \in \mathbb{R}^{K \times d_{\mathrm{model}}}$ that query the mixed feature tensor $x_m$ , generating per-mode trajectory feature tensors $x_{\mathrm{pt}} \in \mathbb{R}^{K \times (T_{\mathrm{hist}}+T_{\mathrm{fut}}) \times d_{\mathrm{model}}}$ .

Three parallel output heads are applied to $x_{\mathrm{pt}}$ :

Regression Header: Outputs predicted future trajectories.
Score Header: Computes per-mode confidence scores via cross-attention with the map.
Temporal Flow Header: Reconstructs the observed history from predicted future features, enforcing representational loop closure.

TFH is positioned in parallel to regression and score heads, each directly connected to the decoder output $x_{\mathrm{pt}}$ .

2. Mathematical Formulation of the Temporal Flow Header

The transformer sublayer operations follow the SceneTransformer convention:

Self-attention: $\operatorname{SelfAtt}_i(X) = \operatorname{Att}_i(X, X, X)$
Cross-attention: $\operatorname{CrossAtt}_{i,j}(X, Y) = \operatorname{Att}_{i,j}(X, Y, Y)$
Scaled dot-product attention: $\operatorname{Att}(Q, K, V) = \operatorname{softmax}(QK^{\top}/\sqrt{d_k})V$ with $Q = XW^Q$ , $K = YW^K$ , $V = YW^V$ .

Decoder output: $x_{\mathrm{pt}} = \operatorname{SelfAtt}_K(\operatorname{CrossAtt}_{K,M}(x_k, x_m))$ with $x_{\mathrm{pt}}\in\mathbb{R}^{K \times (T_{\mathrm{hist}}+T_{\mathrm{fut}}) \times d_{\mathrm{model}}}$ .

TFH processing steps:

Future Features: Extract future-timestep features: $x_f = x_{\mathrm{pt}}[:, T_{\mathrm{hist}}:, :]$ , $x_f \in \mathbb{R}^{K\times T_{\mathrm{fut}}\times d_{\mathrm{model}}}$ .
Temporal Feature Pyramid: Construct a 1D pyramid over the temporal dimension:
- At each scale $\ell$ , compute $P_L = \operatorname{Conv}^{1\times 1}(x_f)$ and recursively
$P_\ell = \operatorname{Conv}^{1\times 1}(x_f~\mathrm{at}~\mathrm{scale}~\ell) + \operatorname{Upsample}(P_{\ell+1})$
Feature Fusion: Fuse the pyramid outputs (by concatenation or summation) and align to $T_{\mathrm{hist}}$ frames, yielding $F_{\mathrm{fpn}} \in \mathbb{R}^{K \times T_{\mathrm{hist}} \times d'_{\mathrm{model}}}$ .
History Reconstruction: $h_{\mathrm{pred}} = \mathrm{MLP}_{\mathrm{tf}}(F_{\mathrm{fpn}})$ , $h_{\mathrm{pred}}\in\mathbb{R}^{K\times T_{\mathrm{hist}}\times 5}$ , where 5 corresponds to $[x, y, \cos\theta, \sin\theta, v]$ .

The TFH loss is mid-level MSE supervision: $L_{\mathrm{tf}} = \| h_{\mathrm{pred}} - h_{\mathrm{gt}} \|_2^2$ Full multi-task loss: $L = L_{\mathrm{reg}} + \beta_1 L_{\mathrm{score}} + \beta_2 L_{\mathrm{tf}},\quad \beta_1 = \beta_2 = 0.3$ where $L_{\mathrm{reg}}$ is a GMM negative log-likelihood and $L_{\mathrm{score}}$ is a max-margin confidence loss.

3. Implementation Strategy and Pseudocode

TFH implementation involves slicing predicted future features, constructing a feature pyramid across the temporal dimension, fusing these at the history length, and using an MLP for history reconstruction:

x_f = x_pt[:, T_hist:, :]    # future-only slice

P[L] = Conv1x1(x_f)          # highest level
for ell in reversed(1..L-1):
    lateral = Conv1x1(downsample(x_f, scale=ell))
    P[ell] = lateral + Upsample(P[ell+1], scale_factor=2)

F_fpn = fuse_levels(P[1], ..., P[L])  # e.g. sum or concat + linear

h_pred = MLP_tf(F_fpn)  # shape [K, T_hist, 5]

L_tf = MSE(h_pred, h_gt)
Total_Loss += beta2 * L_tf

Key implementation configurations:

$d_{\mathrm{model}} = 128$ in all transformer layers.
Agent feature dimension: $A = 32$ (training), $64$ (test).
Map feature dimension: $M = 128$ (training), $256$ (test).
Axial or efficient attention adopted from SceneTransformer.
Shallow 2-layer MLPs (hidden size $\approx 128$ ) for all heads.
FPN temporal convolutions use $1\times 1$ kernels with nearest-neighbor temporal upsampling.
Training over 200 epochs with Adam, using staged learning rates.

Ensemble inference involves running $M$ models, clustering $M\cdot K$ trajectories with K-means, and averaging.

4. Distinction from Standard Attention Mechanisms

TFH differs fundamentally from standard transformer attention mechanisms:

Lack of Reverse Consistency in Self-Attention: Default temporal and agent-wise self-attention only propagates historical information forward into mixed representations, with no explicit mechanism to guarantee that predicted futures retain reconstructible information about the observed past.
Autoencoder-Style Consistency: TFH imposes a backward-consistency objective, requiring history to be reconstructed from predicted future features, analogous to a denoising or sequence autoencoder.
Multi-scale Temporal Aggregation: The use of a temporal FPN aggregates hierarchical temporal patterns among predicted futures, a capacity not natively present in traditional transformer temporal self-attention.
Closed-Loop Representation: The reconstructed history forces the model to encode past motion cues into the predicted future, empirically leading to more consistent and sharper multimodal forecasts.

5. Empirical Results and Performance Impact

TFH is an integral component of TENET's reported metrics on the Argoverse 2 test set:

Baseline with TFH: brier-minFDE@6 = 2.03
Enhancement via Larger Input-Range: brier-minFDE@6 = 2.01
Addition of K-means Ensemble: brier-minFDE@6 = 1.90

Typical transformer baselines achieve brier-minFDE in the 2.10–2.20 range. The inclusion of TFH reduces this error, underpinning TENET's first-place leaderboard submission (Wang et al., 2022). Although no isolated ablation for TFH is reported, TFH's presence is necessary for the baseline and ensemble results, corroborating its empirical effectiveness.

6. Design Significance and Practical Considerations

TFH serves as a reverse-prediction, middle-level supervision head that enforces the encoding of past dynamics in predicted futures. Its integration is lightweight, utilizing a bottleneck FPN and shallow MLP. The design supports consistent multimodal trajectory prediction, scales with the number of modes $K$ , and integrates efficiently with transformer-based frameworks for motion forecasting. TFH's architectural paradigm—reconstruction of past states from predicted futures—may be applicable beyond autonomous driving to any temporal reasoning domain where bidirectional consistency is advantageous.

PDF Markdown Chat (Pro)

References (1)

TENET: Transformer Encoding Network for Effective Temporal Flow on Motion Prediction (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Temporal Flow Header (TENET).