The paper examines the over-smoothing and over-squashing phenomena in Graph Neural Networks (GNNs) through the lens of vanishing gradients, drawing parallels with linear control theory and recurrent models. The authors propose that these issues, which lead to representational collapse and insensitivity to distant nodes, can be better understood and addressed by viewing GNNs as recurrent models.
The key contributions of this work are:
- Introducing GNN-SSM, a GNN model formulated as a state-space model, which provides better control over the Jacobian spectrum.
- Demonstrating a connection between vanishing gradients and over-smoothing, explaining over-smoothing through the spectrum of layer-wise Jacobians and showing that GNN-SSMs can control the rate of smoothing.
- Relating vanishing gradients to over-squashing, arguing that effective mitigation requires both graph rewiring and techniques to prevent vanishing gradients.
The paper starts by highlighting the similarities and differences between sequence models (RNNs) and GNNs. It points out that GNNs, particularly Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), are highly susceptible to what the authors term "extreme gradient vanishing." This is attributed to the spectral contractive nature of the normalized adjacency matrix, which hinders effective information propagation. Lemma 1 and Theorem 1 formalize this claim by analyzing the spectrum of the layer-wise Jacobian in GCNs, showing that the singular values of the Jacobian are modulated by the squared spectrum of the normalized adjacency. The authors show that for a linear GCN layer $\mathbf{H}^{(k)} \;=\; \tilde{\mathbf{A}\;\mathbf{H}^{(k-1)}\;\mathbf{W}$, where $\tilde{\mathbf{A}$ has eigenvalues {λ1,…,λn} and WWT has eigenvalues {μ1,…,μdk}, the squared singular values of the layer-wise Jacobian J=∂vec(H(k))/∂vec(H(k−1)) are given by the set {λi2μj∣i=1,…,n,j=1,…,dk}. Furthermore, if W∈Rdk−1×dk is initialized with i.i.d.\ N(0,σ2) entries, then the mean and variance of each squared singular value γi,j are E[γi,j]=λi2σ2 and Var[γi,j]=λi4σ4dk−1dk, respectively.
To address this, the authors introduce GNN-SSM, inspired by state-space models, to provide direct control over signal propagation dynamics. The GNN layer-to-layer update is rewritten as:
H(k+1)=ΛH(k)+BFθ(H(k),k)
where:
- H(k+1) is the hidden state at layer k+1
- Λ is the state transition matrix
- B is the input matrix
- Fθ(H(k),k) is a time-varying coupling function connecting each node to its neighborhood.
Proposition 1 shows that by appropriately setting the eigenvalues of the memory matrix Λ, the vectorized Jacobian can be brought to the edge of chaos (eig(Λ)≈1).
The paper then explores the relationship between vanishing gradients and over-smoothing. The authors describe over-smoothing as a consequence of GNN layers acting as contractions, causing node features to collapse to a zero fixed point. They connect the graph Dirichlet energy E(H)=tr(H⊤ΔH)=(u,v)∈E∑∥hu−hv∥2 to the layer-wise Jacobians. The authors show that $\mathcal{E}(f(\mathbf{H})) \leq 2 \lvert E \rvert \prod_{k=1}^K \#1{f_k}^2 \#1{\mathbf{H}^2$, where #1fk is the Lipschitz constant of layer k and ∣E∣ is the number of edges in graph G. This offers a more practical explanation for over-smoothing than the conventional understanding of signal projection into the 1-dimensional kernel of the graph Laplacian.
The impact of vanishing gradients on over-squashing is also examined. The authors argue that mitigating over-squashing requires addressing both graph connectivity and the model's capacity to avoid vanishing gradients. They emphasize the importance of preserving signal strength through non-dissipative model dynamics, achieved by combining increased connectivity with non-dissipativity.
The paper supports its theoretical claims with empirical evaluations on datasets such as Cora and the long-range graph benchmark (LRGB). The experiments involve analyzing Dirichlet energy evolution, latent vector norms, and node classification accuracy as the number of layers increases. The results show that the GNN-SSM model, particularly when combined with k-hop aggregation (kGNN-SSM), achieves state-of-the-art performance on tasks requiring long-range dependency modeling, such as the RingTransfer task. The authors also demonstrate that kGNN-SSM matches or outperforms DRew-Delay on graph property prediction tasks and LRGB datasets, highlighting the benefits of their state-space approach.