Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? (2307.14023v3)
Abstract: Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function. By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.
- Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7319–7328, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.568. URL https://aclanthology.org/2021.acl-long.568.
- An Alternative Softmax Operator for Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 243–252. PMLR, July 2017. URL https://proceedings.mlr.press/v70/asadi17a.html. ISSN: 2640-3498.
- Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019. URL http://jmlr.org/papers/v20/17-612.html.
- Eric B Baum. On the capabilities of multilayer perceptrons. Journal of Complexity, 4(3):193–215, September 1988. ISSN 0885-064X. doi: 10.1016/0885-064X(88)90020-9. URL https://www.sciencedirect.com/science/article/pii/0885064X88900209.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116.
- On the Ability and Limitations of Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7096–7116, Online, January 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.576. URL https://aclanthology.org/2020.emnlp-main.576.
- Low-Rank Bottleneck in Multi-head Attention Models. In Proceedings of the 37th International Conference on Machine Learning, pp. 864–873. PMLR, November 2020. URL https://proceedings.mlr.press/v119/bhojanapalli20a.html. ISSN: 2640-3498.
- Convex Optimization. Cambridge University Press, 2004. URL https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- Network size and size of the weights in memorization with two-layers neural networks. In Advances in Neural Information Processing Systems, volume 33, pp. 4977–4986. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/hash/34609bdc08a07ace4e1526bbb1777673-Abstract.html.
- Carroll and Dickinson. Construction of neural nets using the radon transform. In International 1989 Joint Conference on Neural Networks, pp. 607–611 vol.1, 1989. doi: 10.1109/IJCNN.1989.118639.
- Overcoming a Theoretical Limitation of Self-Attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7654–7664, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.527. URL https://aclanthology.org/2022.acl-long.527.
- Tighter Bounds on the Expressivity of Transformer Encoders, May 2023. URL http://arxiv.org/abs/2301.10743. arXiv:2301.10743 [cs].
- Rethinking Attention with Performers. October 2020. URL https://openreview.net/forum?id=Ua6zuk0WRH.
- G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, December 1989. ISSN 1435-568X. doi: 10.1007/BF02551274. URL https://doi.org/10.1007/BF02551274.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. URL http://arxiv.org/abs/1810.04805. arXiv:1810.04805 [cs].
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. March 2022. URL https://openreview.net/forum?id=YicbFdNTTy.
- Inductive Biases and Variable Creation in Self-Attention Mechanisms. In Proceedings of the 39th International Conference on Machine Learning, pp. 5793–5831. PMLR, June 2022. URL https://proceedings.mlr.press/v162/edelman22a.html. ISSN: 2640-3498.
- Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989. ISSN 0893-6080. doi: https://doi.org/10.1016/0893-6080(89)90003-8. URL https://www.sciencedirect.com/science/article/pii/0893608089900038.
- On the rate of convergence of a classifier based on a Transformer encoder, November 2021. URL http://arxiv.org/abs/2111.14574. arXiv:2111.14574 [cs, math, stat].
- Michael Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, 8:156–171, January 2020. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00306. URL https://doi.org/10.1162/tacl_a_00306.
- Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit Complexity. Transactions of the Association for Computational Linguistics, 10:800–810, July 2022. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00490. URL https://doi.org/10.1162/tacl_a_00490.
- Identity Matters in Deep Learning. November 2016. URL https://openreview.net/forum?id=ryxB0Rtxx.
- Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991. ISSN 0893-6080. doi: https://doi.org/10.1016/0893-6080(91)90009-T. URL https://www.sciencedirect.com/science/article/pii/089360809190009T.
- Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274–281, 2003. doi: 10.1109/TNN.2003.809401.
- Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. IEEE Transactions on Neural Networks, 9(1):224–229, 1998. doi: 10.1109/72.655045.
- Approximation theory of transformer networks for sequence modeling, May 2023. URL http://arxiv.org/abs/2305.18475. arXiv:2305.18475 [cs].
- Provable Memorization Capacity of Transformers. February 2023. URL https://openreview.net/forum?id=8JCg5xJCTPR.
- Universal Approximation Under Constraints is Possible with Transformers. October 2021. URL https://openreview.net/forum?id=JGO8CvG5S9.
- Stack More Layers Differently: High-Rank Training Through Low-Rank Updates, July 2023. URL http://arxiv.org/abs/2307.05695. arXiv:2307.05695 [cs].
- On the Expressive Flexibility of Self-Attention Matrices. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8773–8781, June 2023. ISSN 2374-3468. doi: 10.1609/aaai.v37i7.26055. URL https://ojs.aaai.org/index.php/AAAI/article/view/26055. Number: 7.
- ResNet with one-neuron hidden layers is a Universal Approximator. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/hash/03bfc1d4783966c69cc6aef8247e0103-Abstract.html.
- Michael Lederman Littman. Algorithms for Sequential Decision-Making. PhD thesis, USA, 1996. AAI9709069.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach, July 2019. URL http://arxiv.org/abs/1907.11692. arXiv:1907.11692 [cs].
- The Expressive Power of Neural Networks: A View from the Width. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/32cbf687880eb1674a07bf717761dd3a-Abstract.html.
- Your Transformer May Not be as Powerful as You Expect. October 2022. URL https://openreview.net/forum?id=NQFFNdsOGD.
- Memorization Capacity of Multi-Head Attention in Transformers, June 2023. URL http://arxiv.org/abs/2306.02010. arXiv:2306.02010 [cs].
- Saturated Transformers are Constant-Depth Threshold Circuits. Transactions of the Association for Computational Linguistics, 10:843–856, August 2022. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00493. URL https://doi.org/10.1162/tacl_a_00493.
- Deep Double Descent: Where Bigger Models and More Data Hurt. September 2019. URL https://openreview.net/forum?id=B1g5sA4twr.
- Provable Memorization via Deep Neural Networks using Sub-linear Parameters. In Proceedings of Thirty Fourth Conference on Learning Theory, pp. 3627–3661. PMLR, July 2021. URL https://proceedings.mlr.press/v134/park21a.html. ISSN: 2640-3498.
- Improving Language Understanding by Generative Pre-Training. a.
- Language Models are Unsupervised Multitask Learners. b.
- An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks. In Advances in Neural Information Processing Systems, volume 34, pp. 12674–12685. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/69dd2eff9b6a421d5ce262b093bdab23-Abstract.html.
- Visualizing and Measuring the Geometry of BERT. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://papers.nips.cc/paper_files/paper/2019/hash/159c1ffe5b61b41b3c4d8f4c2150f6c4-Abstract.html.
- Eduardo D. Sontag. Shattering All Sets of ‘k’ Points in “General Position” Requires ( k — 1)/2 Parameters. Neural Computation, 9(2):337–348, February 1997. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco.1997.9.2.337. URL https://direct.mit.edu/neco/article/9/2/337-348/6035.
- Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input. In Proceedings of the 40th International Conference on Machine Learning, pp. 33416–33447. PMLR, July 2023. URL https://proceedings.mlr.press/v202/takakura23a.html. ISSN: 2640-3498.
- Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147, 2003. URL https://aclanthology.org/W03-0419.
- On the Optimal Memorization Power of ReLU Neural Networks. January 2022. URL https://openreview.net/forum?id=MkTPtnjeYTV.
- Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Roman Vershynin. Memory capacity of neural networks with threshold and rectified linear unit activations. SIAM Journal on Mathematics of Data Science, 2(4):1004–1033, 2020. doi: 10.1137/20M1314884. URL https://doi.org/10.1137/20M1314884.
- Linformer: Self-Attention with Linear Complexity, June 2020. URL http://arxiv.org/abs/2006.04768. arXiv:2006.04768 [cs, stat].
- Self-Attention Networks Can Process Bounded Hierarchical Languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3770–3785, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.292. URL https://aclanthology.org/2021.acl-long.292.
- Do Transformers Really Perform Badly for Graph Representation? January 2022. URL https://openreview.net/forum?id=OeWooOxFwDa.
- Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2019/hash/dbea3d0e2a17c170c412c74273778159-Abstract.html.
- Are Transformers universal approximators of sequence-to-sequence functions? December 2019. URL https://openreview.net/forum?id=ByxRM0Ntvr.
- O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers. In Advances in Neural Information Processing Systems, volume 33, pp. 13783–13794. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/9ed27554c893b5bad850a422c3538c15-Abstract.html.
- Understanding deep learning requires rethinking generalization. November 2016. URL https://openreview.net/forum?id=Sy8gdB9xx.
- Tokio Kajitsuka (2 papers)
- Issei Sato (82 papers)