Setting the Record Straight on Transformer Oversmoothing (2401.04301v3)
Abstract: Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inputs become more and more similar. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. We hope that our findings give ML researchers and practitioners additional insight into how to develop future Transformer-based models.
- Centered self-attention layers. arXiv preprint arXiv:2306.01610, 2023.
- Regions in the complex plane containing the eigenvalues of a matrix. The American mathematical monthly, 101(10):975–985, 1994.
- Randaugment: Practical automated data augmentation with a reduced search space, 2019.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp. 2793–2803. PMLR, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Understanding convolution on graphs via energies. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=v5ew3FPTgb.
- Vision transformers with patch diversification. arXiv preprint arXiv:2104.12753, 2021.
- Contranorm: A contrastive learning perspective on oversmoothing and beyond. arXiv preprint arXiv:2303.06562, 2023.
- Transformer in transformer. Advances in Neural Information Processing Systems, 34:15908–15919, 2021.
- Benchmarking neural network robustness to common corruptions and perturbations, 2019.
- Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet. arXiv preprint arXiv:2104.10858, 3(6):7, 2021.
- Jean Kaddour. The minipile challenge for data-efficient language models. arXiv preprint arXiv:2304.08442, 2023.
- Challenges and applications of large language models, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Andrej Karpathy. char-rnn. https://github.com/karpathy/char-rnn, 2015.
- Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT/, 2023.
- The unreasonable effectiveness of fully-connected layers for low-data regimes. Advances in Neural Information Processing Systems, 35:1896–1908, 2022.
- Pay attention to mlps. Advances in Neural Information Processing Systems, 34:9204–9215, 2021a.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021b.
- Decoupled weight decay regularization, 2019.
- Matrix analysis and applied linear algebra. SIAM, 2023.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
- How do vision transformers work? In International Conference on Learning Representations, 2022.
- Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems, 34:12116–12128, 2021.
- A survey on oversmoothing in graph neural networks, 2023.
- Kathrin Schacke. On the kronecker product. Master’s thesis, University of Waterloo, 2004.
- Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction. ACS Central Science, 5(9):1572–1583, aug 2019. doi: 10.1021/acscentsci.9b00576. URL https://doi.org/10.1021%2Facscentsci.9b00576.
- Auto-differentiating linear algebra. arXiv preprint arXiv:1710.08717, 2017.
- Revisiting over-smoothing in bert from the perspective of graph. arXiv preprint arXiv:2202.08625, 2022.
- Revisiting unreasonable effectiveness of data in deep learning era, 2017.
- Training data-efficient image transformers & distillation through attention, 2021a.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 32–42, 2021b.
- Llama: Open and efficient foundation language models, 2023.
- Mimetic initialization of self-attention layers. arXiv preprint arXiv:2305.09828, 2023.
- Attention is all you need, 2023.
- Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In International Conference on Learning Representations, 2022.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524–10533. PMLR, 2020.
- Efficient language modeling with sparse all-mlp. arXiv preprint arXiv:2203.06850, 2022a.
- Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10819–10829, 2022b.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 558–567, 2021.
- Cutmix: Regularization strategy to train strong classifiers with localizable features, 2019.
- mixup: Beyond empirical risk minimization, 2018.
- Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.