Sumformer: Universal Approximation for Efficient Transformers (2307.02301v1)
Abstract: Natural language processing (NLP) made an impressive jump with the introduction of Transformers. ChatGPT is one of the most famous examples, changing the perception of the possibilities of AI even outside the research community. However, besides the impressive performance, the quadratic time and space complexity of Transformers with respect to sequence length pose significant limitations for handling long sequences. While efficient Transformer architectures like Linformer and Performer with linear complexity have emerged as promising solutions, their theoretical understanding remains limited. In this paper, we introduce Sumformer, a novel and simple architecture capable of universally approximating equivariant sequence-to-sequence functions. We use Sumformer to give the first universal approximation results for Linformer and Performer. Moreover, we derive a new proof for Transformers, showing that just one attention layer is sufficient for universal approximation.
- Barron, A. R. Approximation and estimation bounds for artificial neural networks. Machine learning, 14:115–133, 1994.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019.
- Briand, E. When is the algebra of multisymmetric polynomials generated by the elementary multisymmetric polynomials? Beiträge zur Algebra und Geometrie: Contributions to Algebra and Geometry, 45 (2), 353-368., 2004.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Representation theorem for multivariable totally symmetric functions. arXiv preprint arXiv:2211.15958, 2022.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7):389–403, 2019.
- Error bounds for approximations with deep relu neural networks in w s, p norms. Analysis and Applications, 18(05):803–859, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- Hutter, M. On representing (anti) symmetric functions. arXiv preprint arXiv:2007.15298, 2020.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156–5165. PMLR, 2020.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- On the expressive power of self-attention matrices. arXiv preprint arXiv:2106.03764, 2021.
- The expressive power of neural networks: A view from the width. Advances in neural information processing systems, 30, 2017.
- Mhaskar, H. N. Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1):164–177, 1996.
- Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural Networks, 108:296–330, 2018.
- Provable approximation properties for deep neural networks. Applied and Computational Harmonic Analysis, 44(3):537–557, 2018.
- Long range arena: A benchmark for efficient transformers, 2020.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- On the limitations of representing functions on sets. In International Conference on Machine Learning, pp. 6487–6494. PMLR, 2019.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Transformers in time series: A survey. arXiv preprint arXiv:2202.07125, 2022.
- Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
- Yarotsky, D. Universal approximations of invariant maps by neural networks. Constructive Approximation, 55(1):407–474, 2022.
- Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077, 2019.
- o(n)𝑜𝑛o(n)italic_o ( italic_n ) connections are expressive enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems, 33:13783–13794, 2020.
- Deep sets. Advances in neural information processing systems, 30, 2017.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.