On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning (2502.10818v1)

Published 15 Feb 2025 in cs.LG and cs.AI

Abstract: Graph Neural Networks (GNNs) are models that leverage the graph structure to transmit information between nodes, typically through the message-passing operation. While widely successful, this approach is well known to suffer from the over-smoothing and over-squashing phenomena, which result in representational collapse as the number of layers increases and insensitivity to the information contained at distant and poorly connected nodes, respectively. In this paper, we present a unified view of these problems through the lens of vanishing gradients, using ideas from linear control theory for our analysis. We propose an interpretation of GNNs as recurrent models and empirically demonstrate that a simple state-space formulation of a GNN effectively alleviates over-smoothing and over-squashing at no extra trainable parameter cost. Further, we show theoretically and empirically that (i) GNNs are by design prone to extreme gradient vanishing even after a few layers; (ii) Over-smoothing is directly related to the mechanism causing vanishing gradients; (iii) Over-squashing is most easily alleviated by a combination of graph rewiring and vanishing gradient mitigation. We believe our work will help bridge the gap between the recurrent and graph neural network literature and will unlock the design of new deep and performant GNNs.

View on arXiv

Authors (8)

Álvaro Arroyo (6 papers)
Alessio Gravina (12 papers)
Benjamin Gutteridge (5 papers)
Federico Barbero (14 papers)
Claudio Gallicchio (34 papers)
Xiaowen Dong (84 papers)
Michael Bronstein (77 papers)
Pierre Vandergheynst (72 papers)

Summary

The paper examines the over-smoothing and over-squashing phenomena in Graph Neural Networks (GNNs) through the lens of vanishing gradients, drawing parallels with linear control theory and recurrent models. The authors propose that these issues, which lead to representational collapse and insensitivity to distant nodes, can be better understood and addressed by viewing GNNs as recurrent models.

The key contributions of this work are:

Introducing GNN-SSM, a GNN model formulated as a state-space model, which provides better control over the Jacobian spectrum.
Demonstrating a connection between vanishing gradients and over-smoothing, explaining over-smoothing through the spectrum of layer-wise Jacobians and showing that GNN-SSMs can control the rate of smoothing.
Relating vanishing gradients to over-squashing, arguing that effective mitigation requires both graph rewiring and techniques to prevent vanishing gradients.

The paper starts by highlighting the similarities and differences between sequence models (RNNs) and GNNs. It points out that GNNs, particularly Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), are highly susceptible to what the authors term "extreme gradient vanishing." This is attributed to the spectral contractive nature of the normalized adjacency matrix, which hinders effective information propagation. Lemma 1 and Theorem 1 formalize this claim by analyzing the spectrum of the layer-wise Jacobian in GCNs, showing that the singular values of the Jacobian are modulated by the squared spectrum of the normalized adjacency. The authors show that for a linear GCN layer $\mathbf{H}^{(k)} \;=\; \tilde{\mathbf{A}\;\mathbf{H}^{(k-1)}\;\mathbf{W}$, where $\tilde{\mathbf{A}$ has eigenvalues $\{\lambda_1,\ldots,\lambda_n\}$ and $\mathbf{W}\,\mathbf{W}^T$ has eigenvalues $\{\mu_1,\ldots,\mu_{d_k}\}$ , the squared singular values of the layer-wise Jacobian $\mathbf{J} = \partial\,\mathrm{vec}\bigl(\mathbf{H}^{(k)}\bigr) / \partial\,\mathrm{vec}\bigl(\mathbf{H}^{(k-1)}\bigr)$ are given by the set $\{\,\lambda_i^2 \,\mu_j \;|\; i=1,\ldots,n,\;\; j=1,\ldots,d_k\}$ . Furthermore, if $\mathbf{W}\in\mathbb{R}^{d_{k-1}\times d_k}$ is initialized with i.i.d.\ $\mathcal{N}(0,\sigma^2)$ entries, then the mean and variance of each squared singular value $\gamma_{i,j}$ are $\mathbb{E}\bigl[\gamma_{i,j}\bigr] = \lambda_i^2 \,\sigma^2$ and $\mathrm{Var}\bigl[\gamma_{i,j}\bigr] = \lambda_i^4 \,\sigma^4\,\frac{d_k}{d_{k-1}}$ , respectively.

To address this, the authors introduce GNN-SSM, inspired by state-space models, to provide direct control over signal propagation dynamics. The GNN layer-to-layer update is rewritten as:

$\mathbf{H}^{(k+1)} = \mathbf{\Lambda}\mathbf{H}^{(k)} + \mathbf{B}\mathbf{F}_{\boldsymbol{\theta}(\mathbf{H}^{(k)}, k)}$

where:

$\mathbf{H}^{(k+1)}$ is the hidden state at layer $k+1$
$\mathbf{\Lambda}$ is the state transition matrix
$\mathbf{B}$ is the input matrix
$\mathbf{F}_{\boldsymbol{\theta}(\mathbf{H}^{(k)}, k)}$ is a time-varying coupling function connecting each node to its neighborhood.

Proposition 1 shows that by appropriately setting the eigenvalues of the memory matrix $\mathbf{\Lambda}$ , the vectorized Jacobian can be brought to the edge of chaos ( $\text{eig}(\Lambda)\approx 1$ ).

The paper then explores the relationship between vanishing gradients and over-smoothing. The authors describe over-smoothing as a consequence of GNN layers acting as contractions, causing node features to collapse to a zero fixed point. They connect the graph Dirichlet energy $\mathcal{E}(\mathbf{H}) = \text{tr} \left(\mathbf{H}^\top \boldsymbol{\Delta}\mathbf{H}\right) = \sum_{(u,v) \in E} \left \lVert \mathbf{h}_u - \mathbf{h}_v \right \rVert^2$ to the layer-wise Jacobians. The authors show that $\mathcal{E}(f(\mathbf{H})) \leq 2 \lvert E \rvert \prod_{k=1}^K \#1{f_k}^2 \#1{\mathbf{H}^2$, where $\#1{f_k}$ is the Lipschitz constant of layer $k$ and $|E|$ is the number of edges in graph $G$ . This offers a more practical explanation for over-smoothing than the conventional understanding of signal projection into the 1-dimensional kernel of the graph Laplacian.

The impact of vanishing gradients on over-squashing is also examined. The authors argue that mitigating over-squashing requires addressing both graph connectivity and the model's capacity to avoid vanishing gradients. They emphasize the importance of preserving signal strength through non-dissipative model dynamics, achieved by combining increased connectivity with non-dissipativity.

The paper supports its theoretical claims with empirical evaluations on datasets such as Cora and the long-range graph benchmark (LRGB). The experiments involve analyzing Dirichlet energy evolution, latent vector norms, and node classification accuracy as the number of layers increases. The results show that the GNN-SSM model, particularly when combined with k-hop aggregation (kGNN-SSM), achieves state-of-the-art performance on tasks requiring long-range dependency modeling, such as the RingTransfer task. The authors also demonstrate that kGNN-SSM matches or outperforms DRew-Delay on graph property prediction tasks and LRGB datasets, highlighting the benefits of their state-space approach.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/arroyo_alvr/status/1891884141639016840

https://twitter.com/fly51fly/status/1893777630446358751