Graph Wavelet Neural Network (1904.07785v1)

Published 12 Apr 2019 in cs.LG and stat.ML

Abstract: We present graph wavelet neural network (GWNN), a novel graph convolutional neural network (CNN), leveraging graph wavelet transform to address the shortcomings of previous spectral graph CNN methods that depend on graph Fourier transform. Different from graph Fourier transform, graph wavelet transform can be obtained via a fast algorithm without requiring matrix eigendecomposition with high computational cost. Moreover, graph wavelets are sparse and localized in vertex domain, offering high efficiency and good interpretability for graph convolution. The proposed GWNN significantly outperforms previous spectral graph CNNs in the task of graph-based semi-supervised classification on three benchmark datasets: Cora, Citeseer and Pubmed.

Citations (307)

View on Semantic Scholar

Summary

The paper introduces GWNN, replacing the traditional graph Fourier transform with graph wavelet transform to improve computational efficiency and local feature representation.
It detaches feature transformation from convolution, reducing parameter complexity from O(n·p·q) to O(n+p·q) and mitigating overfitting in semi-supervised scenarios.
Experiments on Cora, Citeseer, and Pubmed demonstrate that GWNN outperforms existing spectral methods with sparse, interpretable, and scalable graph convolutions.

This paper, "Graph Wavelet Neural Network" (Graph Wavelet Neural Network, 2019), introduces a novel graph convolutional neural network (CNN) model called GWNN, which leverages graph wavelet transform (GWT) instead of the traditional graph Fourier transform (GFT) used in many spectral graph CNNs. The core motivation is to address the limitations of GFT-based methods, such as high computational cost due to eigendecomposition, lack of sparsity, and non-locality of the resulting convolution.

Limitations of Graph Fourier Transform for CNNs

Spectral graph CNNs traditionally define convolution using GFT based on the eigenvectors of the graph Laplacian matrix. While this allows defining filters in the spectral domain, it suffers from several practical drawbacks:

High Computational Cost: Computing the eigendecomposition of the graph Laplacian requires $O(n^3)$ time complexity, where $n$ is the number of nodes, which is prohibitive for large graphs.
Inefficiency: The eigenvectors of the Laplacian are generally dense, making GFT and inverse GFT operations computationally expensive, $O(n^2)$ .
Non-locality: Graph convolution defined via GFT is not localized in the vertex domain, meaning the influence on a node's signal is not restricted to its immediate neighborhood.

Previous works like ChebyNet (The first moment of azimuthal anisotropy in nuclear collisions from AGS to LHC energies, 2016) and GCN (Semi-Supervised Classification with Graph Convolutional Networks, 2016) addressed the computational cost and tried to induce locality by approximating the spectral filter using polynomial expansions of the Laplacian, avoiding eigendecomposition. However, this approximation limits the flexibility of the filter.

Graph Wavelet Transform in GWNN

GWNN proposes using graph wavelets as a new set of bases for spectral representation. Graph wavelets $\psi_s$ are defined via a scaling matrix $G_s$ applied to the Laplacian eigenvectors $U$ : $\psi_s = U G_s^\top$ . The scaling matrix $G_s$ is diagonal with entries $g(s\lambda_i)$ , where $g$ is a function (e.g., heat kernel $e^{\lambda_i s}$ ) and $s$ is a scaling parameter.

Graph wavelet transform offers several advantages for graph convolution:

Efficiency: Graph wavelets $\psi_s$ and their inverses $\psi_s^{-1}$ can be computed efficiently using fast algorithms, such as approximations via Chebyshev polynomials (Ultralong-range polyatomic Rydberg molecules formed by a polar perturber, 2011), which have a computational complexity of $O(m \times |E|)$ , where $m$ is the order of the polynomial and $|E|$ is the number of edges. This avoids the $O(n^3)$ eigendecomposition.
Sparsity: For typical sparse real-world graphs, the matrices $\psi_s$ and $\psi_s^{-1}$ are sparse. This makes the wavelet transform $\hat{x} = \psi_s^{-1} x$ and inverse transform $x = \psi_s \hat{x}$ operations much more efficient than dense matrix-vector multiplications involved in GFT ( $O(n \times \text{non-zeros}(\psi_s^{-1}))$ vs $O(n^2)$ ). Experiments show $\psi_s^{-1}$ can be significantly sparser than $U^\top$ .
Locality and Interpretability: Graph wavelets are localized in the vertex domain. Each wavelet $\psi_{si}$ is centered at node $i$ and represents signal diffusion away from it. This intrinsic locality of wavelets translates to localized graph convolution defined by the wavelet transform (Equation 3: $x *_\mathcal{G} h = \psi_s ((\psi_s^{-1} x) \odot (\psi_s^{-1} h))$ ). The locality also contributes to better interpretability, as shown by analyzing active wavelets for different features.
Flexible Neighborhood: The scaling parameter s allows for adjusting the range of influence of wavelets, effectively controlling the size of the neighborhood considered in the convolution in a continuous manner.

Graph Wavelet Neural Network Architecture

A GWNN layer takes an input feature tensor $X^m \in \mathbb{R}^{n \times p}$ and transforms it into an output tensor $X^{m+1} \in \mathbb{R}^{n \times q}$ . The original layer definition (Equation 4) involves a spectral filter matrix for each pair of input and output features, leading to a large number of parameters $O(n \times p \times q)$ .

To address the high parameter complexity, especially crucial for semi-supervised learning with limited labels, the paper introduces a key implementation technique: detaching feature transformation from graph convolution. Each layer is split into two stages:

Feature Transformation: A standard linear transformation is applied to the input features: $X^{m'} = X^m W$ , where $W \in \mathbb{R}^{p \times q}$ is the weight matrix (Equation 8).
Graph Convolution: The transformed features $X^{m'}$ are convolved using the graph wavelet transform: $X^{m+1} = h(\psi_s \Sigma^m \psi_s^{-1} X^{m'})$ , where $\Sigma^m$ is a diagonal matrix representing the learned convolution kernel in the wavelet domain and $h$ is a non-linear activation (Equation 9, corrected from paper's notation which implies $\psi_s^{-1}$ acts on $X^{m'}$ first).

This separation reduces the parameter complexity per layer to $O(p \times q)$ for the feature transformation weights $W$ and $O(n)$ for the diagonal spectral kernel $\Sigma^m$ , totaling $O(n + p \times q)$ . This is significantly lower than $O(n \times p \times q)$ and competitive with methods like GCN $O(p \times q)$ but with the added $O(n)$ for the diagonal kernel.

For semi-supervised node classification, the paper uses a two-layer GWNN:

Layer 1: ReLU activation (Equation 5)
Layer 2: Softmax activation for class probabilities (Equation 6)

The model is trained using cross-entropy loss on the labeled nodes.

Implementation Considerations and Experiments

Fast Wavelet Computation: The practical implementation relies on the Chebyshev polynomial approximation of $\psi_s$ and $\psi_s^{-1}$ to avoid eigendecomposition (Appendix D).
Sparsity Threshold: For computational efficiency, elements in $\psi_s$ and $\psi_s^{-1}$ smaller than a threshold $t$ are set to zero.
Hyperparameter Tuning: The scale parameter s and sparsity threshold t are tuned using a validation set. The paper observes that accuracy generally increases with s up to a point, then decreases, while t has less influence (Appendix B).
Datasets: Experiments are conducted on standard citation network datasets: Cora, Citeseer, and Pubmed, using the same semi-supervised split as GCN (20 labels per class for training).
Baselines: GWNN is compared against traditional methods, spectral methods (Spectral CNN, ChebyNet, GCN), and spatial methods (MoNet).
Results:
- Detaching feature transformation is shown to be effective, especially on datasets with low label rates like Pubmed, improving accuracy and significantly reducing parameters (Table 2).
- GWNN consistently outperforms Spectral CNN, ChebyNet, GCN, and MoNet on node classification accuracy across all three datasets (Table 3).
- Sparsity analysis confirms that wavelet transform matrices and projected signals are much sparser than their Fourier counterparts on the Cora dataset (Table 4).
- Interpretability analysis demonstrates how the localized nature of wavelets allows interpreting projected signals as correlations between features (words) and nodes (documents), with top-activating nodes for a specific word concentrating in relevant parts of the graph (Figure 3).

Practical Implications and Applications

The GWNN model provides a practical approach for applying deep learning to graph-structured data, particularly in semi-supervised settings where labeled data is scarce. Its key strengths for implementation are:

Scalability: The use of efficient wavelet computation (via Chebyshev approximation) and the detached architecture make GWNN applicable to larger graphs compared to standard spectral methods requiring full eigendecomposition. The $O(m|E|)$ complexity for wavelet basis computation and efficient sparse matrix multiplications are key here.
Improved Performance: The experiments demonstrate state-of-the-art performance on benchmark node classification tasks, suggesting that graph wavelets provide a more suitable basis for defining graph convolution than Fourier bases or their polynomial approximations.
Reduced Overfitting: The significantly reduced parameter count due to the detached architecture helps mitigate overfitting, which is crucial in semi-supervised scenarios with limited labels.
Interpretability: The localized nature of wavelets offers insights into how features relate to nodes and how information propagates through the network during convolution. This can be valuable for debugging and understanding model predictions.

GWNN can be applied to various graph-based tasks beyond semi-supervised classification, such as graph regression, link prediction, and graph representation learning, especially when locality, interpretability, and efficiency on potentially large, sparse graphs are important considerations. However, like other spectral methods, GWNN is inherently transductive (tied to a fixed graph structure defined by the precomputed wavelet bases), although extensions for inductive settings might be possible by adapting the wavelet computation or incorporating sampling strategies. The memory requirement for storing the potentially large $\psi_s$ and $\psi_s^{-1}$ matrices (even if sparse) could still be a factor for extremely large graphs, prompting the need for strategies like only computing/storing specific wavelets or using on-the-fly approximation methods.

PDF Markdown

Graph Wavelet Neural Network (1904.07785v1)

Summary

Related Papers