T-Graphormer: Spatiotemporal Forecasting

Updated 4 February 2026

T-Graphormer is a Transformer-based model that jointly models spatial and temporal dependencies on graphs using learnable centrality and spatiotemporal positional encodings.
It employs a global self-attention mechanism with structured biases to integrate graph structure and time-dependent signals, offering state-of-the-art performance on traffic forecasting benchmarks like PEMS-BAY and METR-LA.
The design unifies spatiotemporal modeling while highlighting scalability challenges and potential enhancements through sparse or hybrid attention mechanisms.

T-Graphormer is a Transformer-based model designed to address spatiotemporal forecasting tasks on graphs by modeling spatial and temporal dependencies jointly rather than separately. Drawing from the architectural innovations of Graphormer, T-Graphormer extends these principles for time-dependent signals on static graphs, enabling effective prediction of phenomena such as traffic speeds through global self-attention with minimal inductive bias (Bai et al., 22 Jan 2025, Ying et al., 2021).

1. Architectural Formulation

T-Graphormer operates on a static graph $\mathcal G = (\mathcal V, \mathcal E, W)$ with $N$ nodes and $T'$ historical timesteps. The model forms an input sequence of length $l = T' \times N$ by concatenating node features $x_{t,i} \in \mathbb R^d$ for $t = t-T'+1, \ldots, t$ and $i = 1, \ldots, N$ . Raw observations $X_{t,i} \in \mathbb R^C$ undergo a linear projection via $W_0 \in \mathbb R^{C \times d}$ .

Each token's final embedded vector incorporates:

Centrality encodings (degree-based, time-agnostic),
Learned spatiotemporal positional encodings,
Pairwise attention spatial bias.

The main body of the model is a stack of $K$ Transformer encoder blocks built with pre-LayerNorm ordering, multi-head self-attention, and feed-forward networks of hidden dimension $d$ . Prediction is performed either through a sequence of linear layers ( $d \to d/2 \to C$ ) or by appending causal dilated convolutional layers before the linear projections.

2. Spatiotemporal Encoding Strategy

T-Graphormer explicitly learns a spatiotemporal positional embedding $P \in \mathbb R^{(T'N) \times d}$ , so that each input token at location $(t, i)$ receives: $h^0_{t,i} = x_{t,i} + z^-_{\deg^-(v_i)} + z^+_{\deg^+(v_i)} + p_{t,i}$ where $z^\pm \in \mathbb R^{D_{\max} \times d}$ are learnable embeddings for in- and out-degree, and $p_{t,i}$ is the learned spatiotemporal positional encoding for node $i$ at time $t$ .

Centrality and positional signals are essential as the vanilla Transformer has no intrinsic awareness of graph structure or temporal ordering. Structural (SPD-based) and temporal (position-based) biases are both realized via learnable token-wise and pair-wise components.

3. Attention Mechanism with Structured Bias

Let $H \in \mathbb R^{l \times d}$ denote token embeddings after encoding. Queries, keys, and values are computed as: $Q = HW_Q, \quad K = HW_K, \quad V = HW_V$ where $W_Q, W_K \in \mathbb R^{d \times d_K}$ and $W_V \in \mathbb R^{d \times d_V}$ .

The attention score between tokens $(t_1, i)$ and $(t_2, j)$ is: $A_{(t_1,i),(t_2,j)} = \frac{Q_{t_1, i} K_{t_2, j}^\top}{\sqrt{d_K}} + b_{\phi(i,j)}$ where $\phi(i, j)$ is the shortest path distance between $v_i$ and $v_j$ on $\mathcal G$ , with learnable scalars $b_{\phi(i,j)}$ . The final output of multi-head attention is

$\mathrm{Attention}(H) = \mathrm{softmax}(A) V$

incorporating both spatial and temporal dependencies uniformly via global attention, as opposed to stacking GNN and sequence models.

4. Training Protocol and Datasets

The optimization target is the mean squared error (MSE) over the forecast horizon $T$ (e.g., 1 hour, $T = 12$ with 5-min increments): $\mathcal{L} = \frac{1}{N\,T}\sum_{i=1}^N\sum_{k=1}^T (\hat X_{t+k, i} - X_{t+k, i})^2$ The AdamW optimizer is employed, with $(\beta_1, \beta_2) = (0.9, 0.999)$ and mild weight decay. The architecture is evaluated (batch size 128) on traffic speed forecasting benchmarks: PEMS-BAY (325 sensors, 52,116 samples) and METR-LA (207 sensors, 34,727 samples). Graph adjacency is defined by a thresholded Gaussian kernel on geodesic pairwise distances: $W_{i,j} = \begin{cases} \exp\!\bigl[-(\mathrm{dist}(v_i, v_j))^2 / \sigma^2\bigr], & \mathrm{dist} \leq \kappa \ 0, & \text{else} \end{cases}$ Input features per node include 12-step speed histories and a one-hot encoding for time-of-day, all Z-score normalized.

5. Empirical Performance and Ablation

On 1-hour prediction tasks (horizon 12), T-Graphormer achieves state-of-the-art scores:

Dataset	MAE	RMSE	MAPE	RMSE $\Delta$	MAPE $\Delta$
PEMS-BAY	1.76	3.78	3.91%	-10.0%	-6.5%
METR-LA	2.94	5.98	7.46%	-14.5%	-22.4%

( $\Delta$ relative reduction compared to the prior state-of-the-art STEP model.)

Ablation studies on METR-LA (horizon 12) demonstrate the importance of encodings:

Removing positional encoding increases MAE by 15.2%
Removing spatial bias increases MAE by 8.1%
Removing both leads to 23.0% higher MAE
Removing centrality encoding increases MAE by 4.8%
Adding a [CLS] token can reduce MAE by approximately 2–3%

Shorter forecast horizons (3, 6) show competitive performance with STEP, but STEP remains superior for 15- and 30-min predictions (Bai et al., 22 Jan 2025).

6. Innovations Beyond Prior Art

T-Graphormer unifies spatiotemporal modeling by using global attention, eliminating the artificial separation of spatial (GNN) and temporal (RNN/Transformer) modules. Only learnable structural and positional biases are imposed, with no need for handcrafted spacetime priors.

Its design fundamentally derives from Graphormer, which established that Transformers, properly augmented with graph-structural encodings (notably via centrality and SPD biases), rival and sometimes surpass message-passing GNNs on large-scale molecular and property-prediction benchmarks (Ying et al., 2021). T-Graphormer maintains this inductively minimal, learnable-bias approach but applies it in a spatiotemporal context.

7. Limitations and Prospective Enhancements

The T-Graphormer’s computational complexity scales as $\mathcal O(l^2)$ , with $l = T' N$ , limiting direct scalability to long historical windows or very large graphs. The current framework only accommodates static graphs.

Plausible extensions include the adoption of sparse or factorized attention (e.g., Longformer, Performer, Linformer) to allow tractable inference for $N \sim 10^4$ nodes, and exploiting self-supervised masked autoencoding for pre-training. Further, incorporation of advanced centrality measures, multi-scale edge encodings, or hybrid GNN–Transformer stacks may potentiate robustness and scalability. The core architecture is broadly applicable beyond traffic data, to domains such as weather grid forecasting, video prediction, epidemiology, and load forecasting in power networks (Bai et al., 22 Jan 2025, Ying et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

T-Graphormer: Using Transformers for Spatiotemporal Forecasting (2025)

Do Transformers Really Perform Bad for Graph Representation? (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to T-Graphormer.