Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Graph Convolutions Enrich the Self-Attention in Transformers! (2312.04234v5)

Published 7 Dec 2023 in cs.LG and cs.AI

Abstract: Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Unified pre-training for program understanding and generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2655–2668, 2021.
  2. Centered self-attention layers. arXiv preprint arXiv: 2306.01610, 2023.
  3. Improving vision transformers by revisiting high-frequency components. In European Conference on Computer Vision, pp.  1–18. Springer, 2022.
  4. The fifth pascal recognizing textual entailment challenge. TAC, 7:8, 2009.
  5. GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  6. A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318, 2020.
  7. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
  8. Alleviating over-smoothing for unsupervised sentence representation. arXiv preprint arXiv:2305.06154, 2023.
  9. The principle of diversity: Training stronger vision transformers calls for reducing all levels of redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12020–12030, 2022.
  10. Quora question pairs, 2018.
  11. Adaptive universal generalized PageRank graph neural network. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  12. Gread: Graph neural reaction-diffusion networks. In International Conference on Machine Learning (ICML), pp. 5722–5747. PMLR, 2023.
  13. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  14. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  15. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005.
  16. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning (ICML), pp. 2793–2803. PMLR, 2021.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  18. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1–48, 2023.
  19. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1536–1547, 2020.
  20. Diffusion improves graph learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019.
  21. Vision transformers with patch diversification. arXiv preprint arXiv:2104.12753, 2021.
  22. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML), volume 32, pp.  1764–1772. PMLR, 22–24 Jun 2014.
  23. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
  24. ContraNorm: A contrastive learning perspective on oversmoothing and beyond. In Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  25. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  26. OGB-LSC: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021.
  27. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4700–4708, 2017.
  28. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012.
  29. Nicolas Keriven. Not too little, not too much: a theoretical analysis of graph (over) smoothing. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp.  2268–2281, 2022.
  30. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  31. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  32. Transformers in speech processing: A survey. arXiv preprint arXiv:2303.11607, 2023.
  33. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, 2020.
  34. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  35. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021.
  36. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  37. Building a large annotated corpus of english: The penn treebank. 1993.
  38. Signal processing on directed graphs: The role of edge directionality when processing and learning from network data. IEEE Signal Processing Magazine, 37(6):99–116, 2020.
  39. A fractional graph laplacian approach to oversmoothing. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  40. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  41. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp.  27198–27211, 2022.
  42. Graph neural networks exponentially lose expressive power for node classification. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  43. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5206–5210. IEEE, 2015.
  44. Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech 2019, 2019.
  45. Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning (ICML), pp. 17627–17643. PMLR, 2022.
  46. Improving language understanding by generative pre-training. 2018.
  47. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  48. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  49. Recipe for a general, powerful, scalable graph transformer. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp.  14501–14515, 2022.
  50. SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624.
  51. A survey on oversmoothing in graph neural networks. arXiv preprint arXiv: Arxiv-2303.10993, 2023.
  52. Discrete signal processing on graphs. IEEE transactions on signal processing, 61(7):1644–1656, 2013.
  53. Discrete signal processing on graphs: Frequency analysis. IEEE Transactions on Signal Processing, 62(12):3042–3054, 2014.
  54. Green AI. Communications of the ACM, 63(12):54–63, 2020.
  55. Revisiting over-smoothing in bert from the perspective of graph. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  56. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013.
  57. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML), pp. 10347–10357. PMLR, 2021a.
  58. Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  32–42, 2021b.
  59. Attention is all you need. In Advances in neural information processing systems (NeurIPS), volume 30, 2017.
  60. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  61. Multi-hop attention graph neural network. In IJCAI, 2021a.
  62. Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  63. Multi-granularity hierarchical attention fusion networks for reading comprehension and question answering. arXiv preprint arXiv:1811.11934, 2018.
  64. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp.  261–271, 2020. doi: 10.1109/SANER48275.2020.9054857.
  65. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  8696–8708, 2021b.
  66. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
  67. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  68. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
  69. CvT: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  22–31, 2021.
  70. Demystifying oversmoothing in attention-based graph neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2023a.
  71. A non-asymptotic analysis of oversmoothing in graph neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2023b.
  72. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning (ICML), pp. 5453–5462, 2018.
  73. Addressing token uniformity in transformers via singular value transformation. In Uncertainty in Artificial Intelligence, pp.  2181–2191. PMLR, 2022.
  74. Random-ltd: Random and layerwise token dropping brings efficient training for large-scale transformers. arXiv preprint arXiv:2211.11586, 2022.
  75. Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pp.  28877–28888, 2021.
  76. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  558–567, 2021.
  77. Stabilizing transformer training by preventing attention entropy collapse. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202, pp.  40770–40803. PMLR, 2023.
  78. On orthogonality constraints for transformers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  375–382, 2021.
  79. Towards end-to-end speech recognition with deep convolutional neural networks. Interspeech 2016, 2016.
  80. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021a.
  81. Refiner: Refining self-attention for vision transformers. arXiv preprint arXiv:2106.03714, 2021b.
  82. Graph neural networks: A review of methods and applications. AI open, 1:57–81, 2020.
  83. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in neural information processing systems (NeurIPS), volume 32, 2019.
  84. A simple yet effective svd-gcn for directed graphs. arXiv preprint arXiv:2205.09335, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com