T-Graphormer: Spatiotemporal Forecasting
- T-Graphormer is a Transformer-based model that jointly models spatial and temporal dependencies on graphs using learnable centrality and spatiotemporal positional encodings.
- It employs a global self-attention mechanism with structured biases to integrate graph structure and time-dependent signals, offering state-of-the-art performance on traffic forecasting benchmarks like PEMS-BAY and METR-LA.
- The design unifies spatiotemporal modeling while highlighting scalability challenges and potential enhancements through sparse or hybrid attention mechanisms.
T-Graphormer is a Transformer-based model designed to address spatiotemporal forecasting tasks on graphs by modeling spatial and temporal dependencies jointly rather than separately. Drawing from the architectural innovations of Graphormer, T-Graphormer extends these principles for time-dependent signals on static graphs, enabling effective prediction of phenomena such as traffic speeds through global self-attention with minimal inductive bias (Bai et al., 22 Jan 2025, Ying et al., 2021).
1. Architectural Formulation
T-Graphormer operates on a static graph with nodes and historical timesteps. The model forms an input sequence of length by concatenating node features for and . Raw observations undergo a linear projection via .
Each token's final embedded vector incorporates:
- Centrality encodings (degree-based, time-agnostic),
- Learned spatiotemporal positional encodings,
- Pairwise attention spatial bias.
The main body of the model is a stack of Transformer encoder blocks built with pre-LayerNorm ordering, multi-head self-attention, and feed-forward networks of hidden dimension . Prediction is performed either through a sequence of linear layers () or by appending causal dilated convolutional layers before the linear projections.
2. Spatiotemporal Encoding Strategy
T-Graphormer explicitly learns a spatiotemporal positional embedding , so that each input token at location receives: where are learnable embeddings for in- and out-degree, and is the learned spatiotemporal positional encoding for node at time .
Centrality and positional signals are essential as the vanilla Transformer has no intrinsic awareness of graph structure or temporal ordering. Structural (SPD-based) and temporal (position-based) biases are both realized via learnable token-wise and pair-wise components.
3. Attention Mechanism with Structured Bias
Let denote token embeddings after encoding. Queries, keys, and values are computed as: where and .
The attention score between tokens and is: where is the shortest path distance between and on , with learnable scalars . The final output of multi-head attention is
incorporating both spatial and temporal dependencies uniformly via global attention, as opposed to stacking GNN and sequence models.
4. Training Protocol and Datasets
The optimization target is the mean squared error (MSE) over the forecast horizon (e.g., 1 hour, with 5-min increments): The AdamW optimizer is employed, with and mild weight decay. The architecture is evaluated (batch size 128) on traffic speed forecasting benchmarks: PEMS-BAY (325 sensors, 52,116 samples) and METR-LA (207 sensors, 34,727 samples). Graph adjacency is defined by a thresholded Gaussian kernel on geodesic pairwise distances: Input features per node include 12-step speed histories and a one-hot encoding for time-of-day, all Z-score normalized.
5. Empirical Performance and Ablation
On 1-hour prediction tasks (horizon 12), T-Graphormer achieves state-of-the-art scores:
| Dataset | MAE | RMSE | MAPE | RMSE | MAPE |
|---|---|---|---|---|---|
| PEMS-BAY | 1.76 | 3.78 | 3.91% | -10.0% | -6.5% |
| METR-LA | 2.94 | 5.98 | 7.46% | -14.5% | -22.4% |
( relative reduction compared to the prior state-of-the-art STEP model.)
Ablation studies on METR-LA (horizon 12) demonstrate the importance of encodings:
- Removing positional encoding increases MAE by 15.2%
- Removing spatial bias increases MAE by 8.1%
- Removing both leads to 23.0% higher MAE
- Removing centrality encoding increases MAE by 4.8%
- Adding a [CLS] token can reduce MAE by approximately 2–3%
Shorter forecast horizons (3, 6) show competitive performance with STEP, but STEP remains superior for 15- and 30-min predictions (Bai et al., 22 Jan 2025).
6. Innovations Beyond Prior Art
T-Graphormer unifies spatiotemporal modeling by using global attention, eliminating the artificial separation of spatial (GNN) and temporal (RNN/Transformer) modules. Only learnable structural and positional biases are imposed, with no need for handcrafted spacetime priors.
Its design fundamentally derives from Graphormer, which established that Transformers, properly augmented with graph-structural encodings (notably via centrality and SPD biases), rival and sometimes surpass message-passing GNNs on large-scale molecular and property-prediction benchmarks (Ying et al., 2021). T-Graphormer maintains this inductively minimal, learnable-bias approach but applies it in a spatiotemporal context.
7. Limitations and Prospective Enhancements
The T-Graphormer’s computational complexity scales as , with , limiting direct scalability to long historical windows or very large graphs. The current framework only accommodates static graphs.
Plausible extensions include the adoption of sparse or factorized attention (e.g., Longformer, Performer, Linformer) to allow tractable inference for nodes, and exploiting self-supervised masked autoencoding for pre-training. Further, incorporation of advanced centrality measures, multi-scale edge encodings, or hybrid GNN–Transformer stacks may potentiate robustness and scalability. The core architecture is broadly applicable beyond traffic data, to domains such as weather grid forecasting, video prediction, epidemiology, and load forecasting in power networks (Bai et al., 22 Jan 2025, Ying et al., 2021).