Spatiotemporal Over-Squashing in STGNNs

Updated 20 January 2026

Spatiotemporal over-squashing is defined as the rapid decay of the spatiotemporal Jacobian in STGNNs, limiting effective information propagation across distant nodes and time steps.
The analysis demonstrates that spatial and temporal influences factorize independently, where each decays based on respective architectural and topological parameters.
Empirical evaluations and design strategies, including temporal rewiring and TTS versus TAS paradigms, provide actionable insights for mitigating over-squashing issues.

Spatiotemporal over-squashing is a compound information bottleneck in spatiotemporal graph neural networks (STGNNs), emerging from the interaction of spatial and temporal propagation constraints. In static GNNs, over-squashing refers to the exponential decay of influence between distant nodes, as diagnosed by the spectral norm of the Jacobian $\|J_{uv}^{(L)}\| = \left\| \frac{\partial h^{(L)}_v}{\partial x^{(0)}_u} \right\|$ with respect to graph distance. In STGNNs, where each node possesses a time series of features, the challenge intensifies: information from both distant nodes and distant time steps can be suppressed, yielding a rapidly decaying spatiotemporal Jacobian. This compound decay creates barriers to long-range interaction in both domains, fundamentally limiting model expressivity and performance across tasks involving dynamic graphs (Marisca et al., 18 Jun 2025).

1. Formalization of Spatiotemporal Over-Squashing

Spatiotemporal over-squashing is defined in terms of the spatiotemporal Jacobian

$J_{t;i}(u \rightarrow v) = \frac{\partial h^{(L)}_t{}^{v}}{\partial x_{t-i}^{u,0}} \in \mathbb{R}^{d \times d}$

where $x_{t-i}^{u}$ denotes the features of node $u$ at time $t-i$ and $h_t^{(v,L)}$ is the representation of node $v$ at time $t$ after $L$ spatiotemporal layers. Over-squashing occurs when, for large spatial distance $d_s(u, v)$ or temporal separation $i$ , the spectral norm $\|J_{t;i}(u \rightarrow v)\|$ falls below a small $\epsilon$ , even if the data generation process dictates that information from $x_{t-i}^{u}$ should significantly influence $h_t^{(v,L)}$ .

The decay of the Jacobian is bounded by: $\|J_{t; i}(u \rightarrow v)\| \leq (\text{architectural constants})^L \cdot (\text{spatiotemporal-topology factor})$ where the topology factor decays rapidly with both graph and temporal distance, enforcing a compound bottleneck not present in static GNNs (Marisca et al., 18 Jun 2025).

2. Theoretical Results: Temporal, Spatial, and Compound Effects

Temporal-Only Case (Temporal Convolutional Networks)

For $L_t$ layers of 1D causal convolutions (TCNs), under bounded filter norm $\|K_p^{(\ell)}\| \leq w$ and activation derivative $|\sigma'| \leq c_\sigma$ , the sensitivity bound is: $\left\| \frac{\partial h_{t-j}^{(L_t)}}{\partial x_{t-i}^{(0)}} \right\| \leq (c_\sigma w)^{L_t} \cdot (T^{L_t})_{ij}$ where $T$ is the Toeplitz backward-shift matrix.

Significantly, deep TCN stacks exhibit a "temporal sink" effect: the influence on the final state at time $t$ becomes dominated by the earliest-in-time token (i.e., largest $i$ ), with recent tokens (smaller $i$ ) contributing exponentially less. Formally,

$\frac{(T^{L_t})_{j0}}{(T^{L_t})_{i0}} = O(L_t^{-(i-j)}) \to 0 \quad \text{as } L_t \to \infty, \;\; i < j$

This is counter to the typical time series locality bias, where more recent steps are presumed more influential (Marisca et al., 18 Jun 2025).

Spatiotemporal Case (Interleaved Spatiotemporal Architectures)

For models with $L$ blocks, each with $L_t$ temporal and $L_s$ spatial message-passing layers (e.g., MPTCNs), the sensitivity bound becomes: $\left\| \frac{\partial h_{t-j}^{(v,L)}}{\partial x_{t-i}^{(u,0)}} \right\| \leq (c_\xi \theta_m)^{L L_s} \cdot (A^{L L_s})_{uv} \cdot (c_\sigma w)^{L L_t} \cdot (T^{L L_t})_{ij}$ with $A$ the spatial message-passing matrix and constants $c_\xi$ , $\theta_m$ as layer/operator norms.

The crucial insight is that the bound factorizes: $(A^{L L_s})_{uv}$ (spatial) times $(T^{L L_t})_{ij}$ (temporal). Thus, degradation in either space or time can independently squash influence, preventing effective information propagation between distant nodes or across distant time-steps.

Moreover, the worst-case bound is insensitive to whether one distributes processing as time-then-space (TTS: $L=1$ ) or interleaved time-and-space (TAS: $L>1$ ), provided total layer budgets $B_t = L L_t$ , $B_s = L L_s$ are fixed (Marisca et al., 18 Jun 2025).

3. Comparison with Static Over-Squashing

In static GNNs, the over-squashing bottleneck concerns only spatial propagation, with

$\left\|\frac{\partial h_v^{(L)}}{\partial x_u^{(0)}}\right\| \leq (c_\xi \theta_m)^L (A^L)_{uv}$

where the exponential decay depends solely on the graph distance.

In contrast, STGNNs are affected by both spatial and temporal propagation barriers, the Jacobian bound given by the product $(A^{L L_s})_{uv} (T^{L L_t})_{ij}$ . This creates a scenario where long-range interactions in either space or time can be excessively damped, compounding information loss. Notably, causal convolutions lead to an inversion of expected locality biases in the temporal domain: distant (rather than recent) time steps dominate influence due to cumulative path multiplicities (Marisca et al., 18 Jun 2025).

4. Processing Paradigms and Computational Trade-offs

STGNNs typically adopt one of two paradigms:

Time-then-Space (TTS): The full temporal history of each node is encoded (typically by a TCN or RNN) into a vector $h_t^{(v)}$ , followed by a single static GNN pass.
Time-and-Space (TAS): Temporal encoding and spatial message passing are interleaved across layers.

Theoretical analysis demonstrates that, for fixed temporal ( $B_t$ ) and spatial ( $B_s$ ) budgets, the worst-case Jacobian decay is identical for TTS and TAS. Consequently, TTS—being up to a factor $T$ cheaper in spatial computational cost ( $O(N |E| d^2)$ for TTS versus $O(L N |E| d^2 T)$ for TAS)—is not fundamentally at a disadvantage concerning spatiotemporal over-squashing. This equivalence enables principled design choices based on computational constraints rather than presumed superiority in information propagation (Marisca et al., 18 Jun 2025).

5. Empirical Assessment across Synthetic and Real-World Tasks

Synthetic Temporal Experiments

In "CopyFirst" and "CopyLast" tasks on sequences of length $T=16$ , causal TCNs successfully propagate early-token (CopyFirst) information once receptive fields are sufficient but consistently fail for late-token (CopyLast) targets as depth increases ( $L_t>5$ ). This confirms the theoretical "sink" effect, where early tokens are over-privileged.

Applying row-normalization to the temporal shift matrix $T$ ensures each token contributes equally to the output, mitigating this bias. Dilated convolutions flatten influence across recent tokens up to the receptive field size but introduce periodic imbalance due to resets (Marisca et al., 18 Jun 2025).

Spatiotemporal Synthetic Experiments

The "RocketMan" task involves predicting averages of features at specific spatiotemporal offsets on graphs (Ring or Lollipop topology, $N=16$ , $T=9$ ). Models on the Lollipop, which exhibits known spatial bottlenecks, perform worse than on the Ring, confirming spatial over-squashing. Increasing temporal kernel size $P$ amplifies the temporal sink effect, impairing prediction for recent time steps. TTS and TAS variants exhibit empirically indistinguishable performance, in line with theoretical expectations.

Real-World Forecasting Benchmarks

Experiments on METR-LA, PEMS-BAY, and EngRAD datasets—encompassing traffic and weather forecasting—show that TTS-based MPTCNs match or outperform TAS-based variants in Mean Absolute Error (MAE). The benefit of row-normalization increases with temporal context length (notably for EngRAD, $T=24$ ), underscoring the exacerbation of over-squashing with longer horizons. Complex architectures such as Graph WaveNet (GWNet), when recast in TTS form, display analogous bottleneck behavior, suggesting a broad relevance of these phenomena (Marisca et al., 18 Jun 2025).

6. Strategies for Mitigation and Model Design

Temporal Graph Rewiring:

Apply row-normalization to the temporal adjacency matrix to equalize sensitivity across time steps, which is effective for forecasting.
Use dilated convolutions to vary temporal reach, flattening the Jacobian row distribution. However, periodic resets can reintroduce imbalance.

Spatial Graph Rewiring:

Utilize standard techniques (e.g., adding edges, curvature-based rewiring) to decrease long-range attenuation in $A^{L L_s}$ , operating independently from temporal remedies.

Architectural Choice:

Prefer TTS models under compute constraints, as they match TAS in bottleneck severity while providing significantly lower cost.
Adjust total spatial and temporal processing depth flexibly, avoiding excessive stacking of pure temporal or spatial layers without concurrent rewiring.

Monitoring Sensitivity:

Track sample Jacobians $\left\|\partial h/\partial x\right\|$ throughout training to identify emergent over-squashing and facilitate targeted intervention with rewiring or normalization procedures.

Spatiotemporal over-squashing is thus rigorously conceptualized as a product of spatial and temporal propagation limits. The formal Jacobian analysis provides foundational guidance for model design, indicates the subtle inversion of temporal locality in standard convolutional architectures, and establishes the computational and expressivity parity of major STGNN processing paradigms (Marisca et al., 18 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Over-squashing in Spatiotemporal Graph Neural Networks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spatiotemporal Over-Squashing.