STGformer: Spatiotemporal Graph Transformers

Updated 4 February 2026

STGformer is a collection of architectures that integrate spatiotemporal graph representations with transformer mechanisms to model complex dependencies.
It employs efficient linearized attention, high-order graph propagation, and gating strategies to drastically reduce computation while maintaining high accuracy.
Applications include large-scale traffic forecasting, multi-agent trajectory prediction, and 3D human pose estimation, highlighting its versatile real-world utility.

STGformer refers to a set of distinct architectures that integrate spatiotemporal graph representations and transformer mechanisms for diverse application domains, including large-scale traffic forecasting, multi-agent trajectory prediction, and 3D human pose estimation. While implementations and objectives vary, these models are unified by explicit modeling of spatiotemporal dependencies using graph structures, efficient attention mechanisms, and or explicit encoding of relational dynamics across space and time. The following sections provide a comprehensive account of major research contributions labeled STGformer, each anchored in the terminology and results from the foundational literature.

1. Spatiotemporal Graph Transformers for Traffic Forecasting

STGformer, in the traffic forecasting context (Wang et al., 2024), is a transformer–graph neural hybrid architecture designed to address computational bottlenecks in predicting values (e.g., vehicle flow, speed) over large-sensor road networks. The core input consists of historical traffic measurements $X \in \mathbb{R}^{T \times N \times C}$ , where $N$ is the number of sensors, $T$ is the temporal window size, and $C$ is the input channel dimension. The underlying road network is represented by a graph $G = (V, E, A)$ with adjacency $A \in \mathbb{R}^{N \times N}$ .

Architecture overview:

Data embedding: Raw sensor data is projected into a $d$ -dimensional space, enriched with daily and weekly cycle encodings and spatiotemporal positional encodings. All are concatenated to form $X_{emb}$ .
Graph propagation module: High-order Chebyshev-type propagation with fixed normalized Laplacian $L = D^{-1/2}(D - A)D^{-1/2}$ , applied recursively up to order $K$ without learned graph weights, yielding $\{ X_0, \ldots, X_K \}$ for multi-hop aggregation.
Single-layer STG attention block: Treating the $T \times N \times d$ tensor as a flat sequence of length $TN$ , standard QKV linear transformations are computed. Linearized attention $E(Q,K,V) = \frac{1}{n}(Q K^{\top})V$ fuses spatiotemporal dependencies without softmax, thereby avoiding $O((TN)^2)$ complexity.
Recursive interaction: Attention is iteratively applied/gated across graph propagation orders, with layer-wise gating via 1x1 conv projections: $p_{k+1} = E(Q_k, K_k, V_k) \odot g_k(p_k)$ .
Output: Aggregate the resulting representations (sum or learned fusion) to decode predictions for future time steps.

2. Mathematical Underpinnings and Efficiency

The mathematical formulation aims for maximal expressiveness (high-order, all-pairs spatiotemporal relationship modeling) at minimal computational expense. By replacing full multi-layer stack attention with a single-layer, linearized block applied to multi-hop graph features, STGformer shifts memory and FLOP requirements from $O(L N^2 T C)$ to $O(K C (|E| + N + T + N T C))$ . For benchmarks with $N \approx 8,600$ (California road graph), STGformer executes at $\sim$ 0.13% of state-of-the-art (STAEformer) FLOPs and achieves $\sim$ 100 $\times$ speedup, with a 99.8% reduction in GPU memory usage in batch inference.

Ablation experiments demonstrate that performance degrades most sharply without the unified spatiotemporal attention (ΔMAE $\approx$ 1.2), followed by loss of high-order propagation, and then removal of spatial or temporal sub-attention.

3. Experimental Protocols and Results

STGformer was evaluated on the LargeST benchmark (comprising San Diego, Bay Area, Los Angeles subgraphs) and established PEMS03/04/07/08 datasets. Training uses z-score normalization, early stopping, and a typical hyperparametrization of $d=32$ , $K=3$ , batch size 64, and Adam-based optimization in PyTorch.

Key results on LargeST (combined $N \approx 8,600$ ):

Method	Params	Avg MAE ↓	Avg RMSE ↓	Avg MAPE ↓	Speedup	GPU Mem ↓
STAEformer	~4.7M	19.97	33.53	12.01%	1× (ref)	100%
STGformer	0.7M	19.58	32.88	11.78%	≈100×	≈0.2%

STGformer outperforms the previous best by 2–3% in MAE and achieves comparable or superior results in other error metrics, using orders of magnitude less computation and memory.

4. STGformer for Multi-Agent Trajectory Prediction

A second major strand (Li et al., 2023) employs STGformer for multi-agent trajectory forecasting. Here, the architecture learns a time-varying directed acyclic graph (Socio-Temporal Graph, STG) capturing pairwise influences among agents across time, with explicit latent variables representing "who influences whom, when".

Socio-Temporal Graph (STG): At each time $t$ , a binary adjacency $\mathcal{G}^t \in \{0,1\}^{n \times (nt)}$ is computed based on latent codes $G_u^t, G_v^t$ , indicating directed connections from past agent positions to each agent's future state.
Latent-variable generative modeling: At each time step, $G^t$ is drawn from an autoregressive Gaussian prior ( $p_\Psi(G^t|G^{0:t-1})$ ), and $x^t$ (agent positions) from a Gaussian conditioned on trajectories and $G^t$ ( $p_\Phi(x^t|x^{0:t-1}, G^t)$ ). The architecture uses a variational posterior ( $q_\Theta$ ) and optimizes the ELBO, incorporating an $L_0$ penalty for graph sparsity.
Attention mechanism: Conventional self-attention is masked according to the learned STG adjacency, with query–key pairs not specified by $\mathcal{G}^t$ set to $-\infty$ before softmax.
Results: STGformer sets state-of-the-art on Stanford Drone Dataset and ETH/UCY, with ablations confirming that explicit modeling of $G^t$ is critical—the absence of the learned graph causes ADE/FDE to double or more.

5. STGformer for 3D Human Pose Estimation

A third line (Liu et al., 2024) adapts STGformer to video-based 3D pose reconstruction. The model addresses underutilization of body-structure priors and insufficient modeling granularity of GCNs in both spatial (skeletal) and temporal domains.

Input and embedding: Sequential 2D joint detections $P_{2D} \in \mathbb{R}^{T \times N \times 2}$ are lifted via a shared FC+GELU embedding.
Stacked blocks: Each block replaces vanilla MHSA with Spatio-Temporal criss-cross Graph (STG) attention, using explicit graph biases for skeleton (spatial) and frame-joint (temporal) relations. Q/K/V projections are split along the channel axis for spatial vs. temporal computations, aggregated per axis, and concatenated.
Dual-path Modulated Hop-wise Regular GCN (MHR-GCN): Parallel GCNs operate across spatial and temporal domains, each aggregating multi-hop adjacency information with weight modulation, skip-connected fusion, and residual layer norm. The two paths are fused via elementwise combination before the next block.
Final layer: A regression head maps the representation to predicted 3D joint positions.
Results: Achieves new SOTA on Human3.6M (Protocol 1: 40.3 mm MPJPE) and on MPI-INF-3DHP (PCK 98.8%, AUC 84.1%). Ablations confirm both STG attention and hop-wise dual-path GCN are necessary for optimal 3D localization.

6. Limitations and Prospective Extensions

Across architectures, STGformer instantiations exhibit several limitations:

The fixed adjacency in the traffic variant cannot accommodate dynamic road networks or real-time graph changes (Wang et al., 2024).
Single-head linear attention may exhibit underfitting on long sequences.
3D pose and multi-agent variants rely on hand-designed inputs (joint detections or trajectories), restricting end-to-end extensibility.
Computational savings derive from linearized or axis-split attention, whose representational expressiveness may be less than full dense attention in certain regimes.

Future work proposed in the literature includes modeling dynamic or learned graph structure, developing multi-head or factorized STG attention for higher scalability, extending the architecture to other spatiotemporal domains (e.g., epidemiology, weather), and continuous-time hybrid GCN-ODE modules.

7. Impact and Generalization

STGformer architectures advance the state of the art by enabling explicit spatiotemporal relational reasoning at orders of magnitude lower computational cost, with competitive or superior accuracy. Results indicate strong generalization across years (in traffic forecasting, a 13–14% RMSE drop in cross-year tests), and highlight the architectural modularity of STGformer as a general-purpose design for large-scale, graph-structured, time-dependent prediction tasks. In multi-agent and spatiotemporal pose domains, learned graph-topology discovery exposes interpretable social and kinematic patterns, supporting empirical claims of structural locality and relational inference capacity.

For full experimental, ablation, and implementation details for the various STGformer architectures, refer to the primary sources (Wang et al., 2024, Li et al., 2023), and (Liu et al., 2024).

Markdown Upgrade to Chat

References (3)

STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting (2024)

Learning Socio-Temporal Graphs for Multi-Agent Trajectory Prediction (2023)

STGFormer: Spatio-Temporal GraphFormer for 3D Human Pose Estimation in Video (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to STGformer.