On the Connection Between MPNN and Graph Transformer (2301.11956v4)
Abstract: Graph Transformer (GT) recently has emerged as a new paradigm of graph learning algorithms, outperforming the previously popular Message Passing Neural Network (MPNN) on multiple benchmarks. Previous work (Kim et al., 2022) shows that with proper position embedding, GT can approximate MPNN arbitrarily well, implying that GT is at least as powerful as MPNN. In this paper, we study the inverse connection and show that MPNN with virtual node (VN), a commonly used heuristic with little theoretical understanding, is powerful enough to arbitrarily approximate the self-attention layer of GT. In particular, we first show that if we consider one type of linear transformer, the so-called Performer/Linear Transformer (Choromanski et al., 2020; Katharopoulos et al., 2020), then MPNN + VN with only O(1) depth and O(1) width can approximate a self-attention layer in Performer/Linear Transformer. Next, via a connection between MPNN + VN and DeepSets, we prove the MPNN + VN with O(nd) width and O(1) depth can approximate the self-attention layer arbitrarily well, where d is the input feature dimension. Lastly, under some assumptions, we provide an explicit construction of MPNN + VN with O(1) width and O(n) depth approximating the self-attention layer in GT arbitrarily well. On the empirical side, we demonstrate that 1) MPNN + VN is a surprisingly strong baseline, outperforming GT on the recently proposed Long Range Graph Benchmark (LRGB) dataset, 2) our MPNN + VN improves over early implementation on a wide range of OGB datasets and 3) MPNN + VN outperforms Linear Transformer and MPNN on the climate modeling task.
- On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205, 2020.
- Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
- How attentive are graph attention networks? arXiv preprint arXiv:2105.14491, 2021.
- A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318, 2020.
- Structure-aware transformer for graph representation learning. In International Conference on Machine Learning, pp. 3469–3489. PMLR, 2022.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
- Deep learning for physical processes: Incorporating prior scientific knowledge. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=By4HsfWAZ.
- Deep learning for physical processes: incorporating prior scientific knowledge. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124009, dec 2019. doi: 10.1088/1742-5468/ab3195. URL https://dx.doi.org/10.1088/1742-5468/ab3195.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699, 2020.
- Long range graph benchmark. arXiv preprint arXiv:2206.08164, 2022.
- Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, pp. 2286–2296. PMLR, 2021.
- Transformers meet directed graphs. arXiv preprint arXiv:2302.00049, 2023.
- Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263–1272. PMLR, 2017.
- A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 2022.
- Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
- Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021.
- Improvements of the daily optimum interpolation sea surface temperature (doisst) version 2.1. Journal of Climate, 34(8):2923 – 2939, 2021. doi: 10.1175/JCLI-D-20-0166.1. URL https://journals.ametsoc.org/view/journals/clim/34/8/JCLI-D-20-0166.1.xml.
- Global self-attention as a replacement for graph convolution. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 655–665, 2022.
- An analysis of virtual nodes in graph neural networks for link prediction. In Learning on Graphs Conference, 2022.
- Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv preprint arXiv:2108.05542, 2021.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), 2020a. URL https://arxiv.org/abs/2006.16236.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156–5165. PMLR, 2020b.
- Pure transformers are powerful graph learners. arXiv preprint arXiv:2207.02505, 2022.
- Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
- Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618–21629, 2021.
- Deeper insights into graph convolutional networks for semi-supervised learning. In Thirty-Second AAAI conference on artificial intelligence, 2018.
- Sign and basis invariant networks for spectral graph representation learning. arXiv preprint arXiv:2202.13013, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022, 2021.
- Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902, 2018.
- Graphit: Encoding graph structure in transformers. arXiv preprint arXiv:2106.05667, 2021.
- Attending to graph transformers. arXiv preprint arXiv:2302.04181, 2023.
- Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. arXiv preprint arXiv:1811.01900, 2018.
- Graph neural networks exponentially lose expressive power for node classification. arXiv preprint arXiv:1905.10947, 2019.
- Grpe: Relative positional encoding for graph transformer. In ICLR2022 Machine Learning for Drug Discovery, 2022.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660, 2017.
- Recipe for a general, powerful, scalable graph transformer. arXiv preprint arXiv:2205.12454, 2022.
- Daily high-resolution blended analyses for sea surface temperature. J. Climate, 20:5473–5496, 2007.
- A simple neural network module for relational reasoning. Advances in neural information processing systems, 30, 2017.
- On universal equivariant set networks. arXiv preprint arXiv:1910.02421, 2019.
- Benchmarking graphormer on large-scale molecular modeling datasets. arXiv preprint arXiv:2203.04810, 2022.
- Efficient transformers: A survey. ACM Computing Surveys (CSUR), 2020.
- Understanding over-squashing and bottlenecks on graphs via curvature. arXiv preprint arXiv:2111.14522, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
- Universal approximation of functions on sets. Journal of Machine Learning Research, 23(151):1–56, 2022.
- Towards physics-informed deep learning for turbulent flow prediction. pp. 1457–1466, 08 2020a. doi: 10.1145/3394486.3403198.
- Meta-learning dynamics forecasting using task inference. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=BsSP7pZGFQO.
- Linformer: Self-attention with linear complexity, 2020b.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45, 2020.
- Nodeformer: A scalable graph structure learning transformer for node classification. In Advances in Neural Information Processing Systems, 2022.
- Representing long-range context for graph neural networks with global attention. Advances in Neural Information Processing Systems, 34:13266–13279, 2021.
- Revisiting over-smoothing in deep gcns. arXiv preprint arXiv:2003.13663, 2020.
- Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems, 34:28877–28888, 2021.
- Deep sets. Advances in neural information processing systems, 30, 2017.
- Pairnorm: Tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223, 2019.
- Exponential separations in symmetric neural networks. arXiv preprint arXiv:2206.01266, 2022.