Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? (2307.14023v3)

Published 26 Jul 2023 in cs.LG

Abstract: Existing analyses of the expressive capacity of Transformer models have required excessively deep layers for data memorization, leading to a discrepancy with the Transformers actually used in practice. This is primarily due to the interpretation of the softmax function as an approximation of the hardmax function. By clarifying the connection between the softmax function and the Boltzmann operator, we prove that a single layer of self-attention with low-rank weight matrices possesses the capability to perfectly capture the context of an entire input sequence. As a consequence, we show that one-layer and single-head Transformers have a memorization capacity for finite samples, and that Transformers consisting of one self-attention layer with two feed-forward neural networks are universal approximators for continuous permutation equivariant functions on a compact domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  7319–7328, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.568. URL https://aclanthology.org/2021.acl-long.568.
  2. An Alternative Softmax Operator for Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning, pp.  243–252. PMLR, July 2017. URL https://proceedings.mlr.press/v70/asadi17a.html. ISSN: 2640-3498.
  3. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019. URL http://jmlr.org/papers/v20/17-612.html.
  4. Eric B Baum. On the capabilities of multilayer perceptrons. Journal of Complexity, 4(3):193–215, September 1988. ISSN 0885-064X. doi: 10.1016/0885-064X(88)90020-9. URL https://www.sciencedirect.com/science/article/pii/0885064X88900209.
  5. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. doi: 10.1073/pnas.1903070116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1903070116.
  6. On the Ability and Limitations of Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7096–7116, Online, January 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.576. URL https://aclanthology.org/2020.emnlp-main.576.
  7. Low-Rank Bottleneck in Multi-head Attention Models. In Proceedings of the 37th International Conference on Machine Learning, pp.  864–873. PMLR, November 2020. URL https://proceedings.mlr.press/v119/bhojanapalli20a.html. ISSN: 2640-3498.
  8. Convex Optimization. Cambridge University Press, 2004. URL https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf.
  9. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  10. Network size and size of the weights in memorization with two-layers neural networks. In Advances in Neural Information Processing Systems, volume 33, pp.  4977–4986. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/hash/34609bdc08a07ace4e1526bbb1777673-Abstract.html.
  11. Carroll and Dickinson. Construction of neural nets using the radon transform. In International 1989 Joint Conference on Neural Networks, pp.  607–611 vol.1, 1989. doi: 10.1109/IJCNN.1989.118639.
  12. Overcoming a Theoretical Limitation of Self-Attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7654–7664, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.527. URL https://aclanthology.org/2022.acl-long.527.
  13. Tighter Bounds on the Expressivity of Transformer Encoders, May 2023. URL http://arxiv.org/abs/2301.10743. arXiv:2301.10743 [cs].
  14. Rethinking Attention with Performers. October 2020. URL https://openreview.net/forum?id=Ua6zuk0WRH.
  15. G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, December 1989. ISSN 1435-568X. doi: 10.1007/BF02551274. URL https://doi.org/10.1007/BF02551274.
  16. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. URL http://arxiv.org/abs/1810.04805. arXiv:1810.04805 [cs].
  17. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. March 2022. URL https://openreview.net/forum?id=YicbFdNTTy.
  18. Inductive Biases and Variable Creation in Self-Attention Mechanisms. In Proceedings of the 39th International Conference on Machine Learning, pp.  5793–5831. PMLR, June 2022. URL https://proceedings.mlr.press/v162/edelman22a.html. ISSN: 2640-3498.
  19. Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989. ISSN 0893-6080. doi: https://doi.org/10.1016/0893-6080(89)90003-8. URL https://www.sciencedirect.com/science/article/pii/0893608089900038.
  20. On the rate of convergence of a classifier based on a Transformer encoder, November 2021. URL http://arxiv.org/abs/2111.14574. arXiv:2111.14574 [cs, math, stat].
  21. Michael Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, 8:156–171, January 2020. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00306. URL https://doi.org/10.1162/tacl_a_00306.
  22. Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit Complexity. Transactions of the Association for Computational Linguistics, 10:800–810, July 2022. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00490. URL https://doi.org/10.1162/tacl_a_00490.
  23. Identity Matters in Deep Learning. November 2016. URL https://openreview.net/forum?id=ryxB0Rtxx.
  24. Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991. ISSN 0893-6080. doi: https://doi.org/10.1016/0893-6080(91)90009-T. URL https://www.sciencedirect.com/science/article/pii/089360809190009T.
  25. Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Transactions on Neural Networks, 14(2):274–281, 2003. doi: 10.1109/TNN.2003.809401.
  26. Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. IEEE Transactions on Neural Networks, 9(1):224–229, 1998. doi: 10.1109/72.655045.
  27. Approximation theory of transformer networks for sequence modeling, May 2023. URL http://arxiv.org/abs/2305.18475. arXiv:2305.18475 [cs].
  28. Provable Memorization Capacity of Transformers. February 2023. URL https://openreview.net/forum?id=8JCg5xJCTPR.
  29. Universal Approximation Under Constraints is Possible with Transformers. October 2021. URL https://openreview.net/forum?id=JGO8CvG5S9.
  30. Stack More Layers Differently: High-Rank Training Through Low-Rank Updates, July 2023. URL http://arxiv.org/abs/2307.05695. arXiv:2307.05695 [cs].
  31. On the Expressive Flexibility of Self-Attention Matrices. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8773–8781, June 2023. ISSN 2374-3468. doi: 10.1609/aaai.v37i7.26055. URL https://ojs.aaai.org/index.php/AAAI/article/view/26055. Number: 7.
  32. ResNet with one-neuron hidden layers is a Universal Approximator. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/hash/03bfc1d4783966c69cc6aef8247e0103-Abstract.html.
  33. Michael Lederman Littman. Algorithms for Sequential Decision-Making. PhD thesis, USA, 1996. AAI9709069.
  34. RoBERTa: A Robustly Optimized BERT Pretraining Approach, July 2019. URL http://arxiv.org/abs/1907.11692. arXiv:1907.11692 [cs].
  35. The Expressive Power of Neural Networks: A View from the Width. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/hash/32cbf687880eb1674a07bf717761dd3a-Abstract.html.
  36. Your Transformer May Not be as Powerful as You Expect. October 2022. URL https://openreview.net/forum?id=NQFFNdsOGD.
  37. Memorization Capacity of Multi-Head Attention in Transformers, June 2023. URL http://arxiv.org/abs/2306.02010. arXiv:2306.02010 [cs].
  38. Saturated Transformers are Constant-Depth Threshold Circuits. Transactions of the Association for Computational Linguistics, 10:843–856, August 2022. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00493. URL https://doi.org/10.1162/tacl_a_00493.
  39. Deep Double Descent: Where Bigger Models and More Data Hurt. September 2019. URL https://openreview.net/forum?id=B1g5sA4twr.
  40. Provable Memorization via Deep Neural Networks using Sub-linear Parameters. In Proceedings of Thirty Fourth Conference on Learning Theory, pp.  3627–3661. PMLR, July 2021. URL https://proceedings.mlr.press/v134/park21a.html. ISSN: 2640-3498.
  41. Improving Language Understanding by Generative Pre-Training. a.
  42. Language Models are Unsupervised Multitask Learners. b.
  43. An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks. In Advances in Neural Information Processing Systems, volume 34, pp.  12674–12685. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/69dd2eff9b6a421d5ce262b093bdab23-Abstract.html.
  44. Visualizing and Measuring the Geometry of BERT. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://papers.nips.cc/paper_files/paper/2019/hash/159c1ffe5b61b41b3c4d8f4c2150f6c4-Abstract.html.
  45. Eduardo D. Sontag. Shattering All Sets of ‘k’ Points in “General Position” Requires ( k — 1)/2 Parameters. Neural Computation, 9(2):337–348, February 1997. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco.1997.9.2.337. URL https://direct.mit.edu/neco/article/9/2/337-348/6035.
  46. Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input. In Proceedings of the 40th International Conference on Machine Learning, pp.  33416–33447. PMLR, July 2023. URL https://proceedings.mlr.press/v202/takakura23a.html. ISSN: 2640-3498.
  47. Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp.  142–147, 2003. URL https://aclanthology.org/W03-0419.
  48. On the Optimal Memorization Power of ReLU Neural Networks. January 2022. URL https://openreview.net/forum?id=MkTPtnjeYTV.
  49. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  50. Roman Vershynin. Memory capacity of neural networks with threshold and rectified linear unit activations. SIAM Journal on Mathematics of Data Science, 2(4):1004–1033, 2020. doi: 10.1137/20M1314884. URL https://doi.org/10.1137/20M1314884.
  51. Linformer: Self-Attention with Linear Complexity, June 2020. URL http://arxiv.org/abs/2006.04768. arXiv:2006.04768 [cs, stat].
  52. Self-Attention Networks Can Process Bounded Hierarchical Languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  3770–3785, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.292. URL https://aclanthology.org/2021.acl-long.292.
  53. Do Transformers Really Perform Badly for Graph Representation? January 2022. URL https://openreview.net/forum?id=OeWooOxFwDa.
  54. Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2019/hash/dbea3d0e2a17c170c412c74273778159-Abstract.html.
  55. Are Transformers universal approximators of sequence-to-sequence functions? December 2019. URL https://openreview.net/forum?id=ByxRM0Ntvr.
  56. O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers. In Advances in Neural Information Processing Systems, volume 33, pp.  13783–13794. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/9ed27554c893b5bad850a422c3538c15-Abstract.html.
  57. Understanding deep learning requires rethinking generalization. November 2016. URL https://openreview.net/forum?id=Sy8gdB9xx.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Tokio Kajitsuka (2 papers)
  2. Issei Sato (82 papers)
Citations (11)

Summary

Analyzing the Expressive Capacity of Transformers with One Self-Attention Layer

The research presented in "Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?" offers a substantive analysis of the expressive capacities of Transformer models when significantly reducing their complexity to a one-layer design. This paper challenges the prevailing belief that Transformers require multiple layers and heads to achieve universal approximation capabilities. Through a novel theoretical construct, it posits that a single-layer, single-head Transformer equipped with low-rank weight matrices suffices as a universal approximator for continuous permutation equivariant functions.

The authors begin by investigating the contextual mapping capabilities of Transformer models, a critical component for understanding and capturing the dependencies within input sequences. They critique prior works, which required deep architectures or numerous attention heads, drawing a focus on the distinction between hardmax and softmax functions in the self-attention mechanism. Notably, the analysis bridges gaps in understanding by clarifying the relation between the softmax function and the Boltzmann operator. These theoretical insights underpin a proof that softmax-based attention, even with constrained parameters, can effectively serve as a contextual mapping.

The research highlights an intriguing contradiction: while hardmax-based attention layers fail in providing adequate expressive power for complex contextual mappings, a softmax layer can achieve this through the nuanced handling of probability distributions facilitated by the Boltzmann operator. The paper demonstrates that this capability allows a one-layer Transformer to retain memorization capacity for finite samples—a departure from previous assertions that significantly larger architectures were necessary.

Furthermore, the implications of this finding are profound, particularly when applied to memory and resource-efficient model deployment. The results suggest that the depth and multiple heads of practical Transformer models may not be strictly necessary for their functional capacities related to universal approximation of certain functions. This has potential implications for the design and computational efficiency of neural models across various applications.

In the domain of practical implementations, this research could significantly impact the training and tuning of Transformer architectures, especially in areas where computational resources are constrained. The theoretical backing offers a pathway to developing more lightweight models without sacrificing the robustness or expressive power typically associated with deeper implementations.

Finally, while the paper concentrates on theoretical aspects, it opens doors for further empirical studies to explore the practical ramifications of deploying such simplified architectures in real-world tasks, from natural language processing to complex multi-modal data analysis. As AI continues its trajectory of growing capabilities and applications, such insights can fundamentally shift resource allocation and system requirements.

Conclusively, this research affords a deeper understanding of Transformers' expressive capacity, reducing the perceived need for complex, multilayer structures and proposing efficient alternatives that maintain functional integrity. Future work could extend this theoretical framework to explore optimization techniques in the training of such models, aiming towards practical deployment with minimal computational overhead while striving for high performance across diverse AI applications.