On the Role of Attention Masks and LayerNorm in Transformers (2405.18781v2)
Abstract: Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse as depth increases, limiting model expressivity and further utilization of model depth. The existing literature on rank collapse, however, has mostly overlooked other critical components in transformers that may alleviate the rank collapse issue. In this paper, we provide a general analysis of rank collapse under self-attention, taking into account the effects of attention masks and layer normalization (LayerNorm). In particular, we find that although pure masked attention still suffers from exponential collapse to a rank one subspace, sparse or local masked attention can provably slow down the collapse rate. In the case of self-attention with LayerNorm, we first show that for certain classes of value matrices, collapse to a rank one subspace still happens exponentially. However, through construction of nontrivial counterexamples, we then establish that with proper choice of value matrices, a general class of sequences may not converge to a rank one subspace, and the self-attention dynamics with LayerNorm can simultaneously possess a rich set of equilibria with any possible rank between one and full. Our result refutes the previous hypothesis that LayerNorm plays no role in the rank collapse of self-attention and suggests that self-attention with LayerNorm constitutes a much more expressive, versatile nonlinear dynamical system than what was originally thought.
- Is anisotropy truly harmful? a case study on text clustering. In ACL, 2023.
- Fast attention requires bounded entries. In NeurIPS, 2023.
- Layer normalization. ArXiv, abs/1607.06450, 2016.
- Longformer: The long-document transformer. ArXiv, abs/2004.05150, 2020.
- On the expressivity role of layernorm in transformers’ attention. In ACL, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Isotropy in the contextual embedding space: Clusters and manifolds. In ICLR, 2021.
- Generating long sequences with sparse transformers. ArXiv, abs/1904.10509, 2019.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In ACL, 2019.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML, 2021.
- Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. In EMNLP, 2019.
- Representation degeneration problem in training natural language generation models. In ICLR, 2019.
- A mathematical perspective on transformers. ArXiv, abs/2312.10794, 2023a.
- The emergence of clusters in self-attention dynamics. In NeurIPS, 2023b.
- Anisotropy is inherent to self-attention in transformers. ArXiv, abs/2401.12143, 2024.
- Darald J. Hartfiel. Nonhomogeneous Matrix Products. 2002.
- Deep transformers without shortcuts: Modifying self-attention for faithful signal propagation. In ICLR, 2023.
- Transformerfam: Feedback attention is working memory. ArXiv, abs/2404.09173, 2024.
- On the impact of activation and normalization in obtaining isometric embeddings at initialization. In NeurIPS, 2023.
- Albert: A lite bert for self-supervised learning of language representations. In ICLR, 2020.
- Isotropy, clusters, and classifiers. ArXiv, abs/2402.03191, 2024.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In NeurIPS, 2022.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- Language models are unsupervised multitask learners. 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 2020.
- Revisiting over-smoothing in bert from the perspective of graph. In ICLR, 2022.
- Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. In NeurIPS, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Transformers: State-of-the-art natural language processing. In EMNLP, 2020.
- Demystifying oversmoothing in attention-based graph neural networks. In NeurIPS, 2023.
- On layer normalization in the transformer architecture. In ICML, 2020.
- Are transformers universal approximators of sequence-to-sequence functions? In ICLR, 2020a.
- O(n) connections are expressive enough: Universal approximability of sparse transformers. In NeurIPS, 2020b.
- Big bird: Transformers for longer sequences. In NeurIPS, 2020.
- Xinyi Wu (47 papers)
- Amir Ajorlou (11 papers)
- Yifei Wang (141 papers)
- Stefanie Jegelka (122 papers)
- Ali Jadbabaie (143 papers)