Linear Log-Normal Attention with Unbiased Concentration (2311.13541v4)
Abstract: Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This limitation poses a substantial obstacle when dealing with long documents or high-resolution images. In this work, we study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. Furthermore, we propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention, designed to emulate the distribution and concentration behavior of the original self-attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives, offering a promising avenue for enhancing the scalability of transformer models.
- Attention is all you need, 2017. URL http://arxiv.org/abs/1706.03762.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- BERT: pre-training of deep bidirectional transformers for language understanding, 2018. URL http://arxiv.org/abs/1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2020. URL https://arxiv.org/abs/2010.11929.
- The best of both worlds: Combining recent advances in neural machine translation, 2018. URL http://arxiv.org/abs/1804.09849.
- HIBERT: document level pre-training of hierarchical bidirectional transformers for document summarization, 2019. URL http://arxiv.org/abs/1905.06566.
- On extractive and abstractive neural document summarization with transformer language models, November 2020. URL https://aclanthology.org/2020.emnlp-main.748.
- Neural machine translation by jointly learning to align and translate, January 2015. 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Generating long sequences with sparse transformers, 2019. URL http://arxiv.org/abs/1904.10509.
- Big bird: Transformers for longer sequences, 2020. URL https://arxiv.org/abs/2007.14062.
- Rethinking attention with performers, 2020. URL https://arxiv.org/abs/2009.14794.
- Transformers are rnns: Fast autoregressive transformers with linear attention, 2020. URL https://arxiv.org/abs/2006.16236.
- Simon Coste. The spectral gap of sparse random digraphs, 2017. URL https://arxiv.org/abs/1708.00530.
- Transformers are deep infinite-dimensional non-mercer binary kernel machines, 2021. URL https://arxiv.org/abs/2106.01506.
- Elizbar A Nadaraya. On estimating regression, 1964.
- Robustify transformers with robust kernel density estimation, 2022. URL https://arxiv.org/abs/2210.05794.
- Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel, 2019. URL http://arxiv.org/abs/1908.11775.
- J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations, 1909.
- Long-short transformer: Efficient transformers for language and vision, 2021. URL https://arxiv.org/abs/2107.02192.
- Axial attention in multidimensional transformers, 2019. URL http://arxiv.org/abs/1912.12180.
- Blockwise self-attention for long document understanding, 2019. URL http://arxiv.org/abs/1911.02972.
- Longformer: The long-document transformer, 2020. URL https://arxiv.org/abs/2004.05150.
- KVT: k-nn attention for boosting vision transformers. CoRR, abs/2106.00515, 2021. URL https://arxiv.org/abs/2106.00515.
- Linformer: Self-attention with linear complexity, 2020. URL https://arxiv.org/abs/2006.04768.
- Synthesizer: Rethinking self-attention in transformer models, 2020a. URL https://arxiv.org/abs/2005.00743.
- Nyströmformer: A nyström-based algorithm for approximating self-attention, 2021. URL https://arxiv.org/abs/2102.03902.
- Skyformer: Remodel self-attention with gaussian kernel and nyström method, 2021. URL https://arxiv.org/abs/2111.00035.
- cosformer: Rethinking softmax in attention, 2022a. URL https://arxiv.org/abs/2202.08791.
- Set transformer, 2018a. URL http://arxiv.org/abs/1810.00825.
- Compressive transformers for long-range sequence modelling, 2019. URL http://arxiv.org/abs/1911.05507.
- Random feature attention, 2021. URL https://arxiv.org/abs/2103.02143.
- Reformer: The efficient transformer, 2020. URL https://arxiv.org/abs/2001.04451.
- Efficient content-based sparse attention with routing transformers, 2020. URL https://arxiv.org/abs/2003.05997.
- Sparse sinkhorn attention, 2020b. URL https://arxiv.org/abs/2002.11296.
- Fast transformers with clustered attention, 2020. URL https://arxiv.org/abs/2007.04825.
- The devil in linear transformer, 2022b. URL https://arxiv.org/abs/2210.10340.
- Deep neural networks as gaussian processes, 2018b.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
- ACIQ: analytical clipping for integer quantization of neural networks. CoRR, abs/1810.05723, 2018. URL http://arxiv.org/abs/1810.05723.
- Leo A. Goodman. On the exact variance of products, 1960. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1960.10483369.
- Leslie H. Fenton. The sum of log-normal probability distributions in scatter transmission systems, 1960.
- Markov chains and mixing times. American Mathematical Society, 2006. URL http://scholar.google.com/scholar.bib?q=info:3wf9IU94tyMJ:scholar.google.com/&output=citation&hl=en&as_sdt=2000&ct=citation&cd=0.
- What does attention in neural machine translation pay attention to?, 2017. URL http://arxiv.org/abs/1710.03348.
- Analyzing the structure of attention in a transformer language model, 2019. URL http://arxiv.org/abs/1906.04284.
- C.D. Meyer. Matrix Analysis and Applied Linear Algebra. Society for Industrial Mathematics, 2000. URL http://www.matrixanalysis.com/DownloadChapters.html.
- Hans Samelson. On the perron-frobenius theorem., 1957.
- Broad distribution effects in sums of lognormal random variables, apr 2003. URL https://doi.org/10.1140%2Fepjb%2Fe2003-00131-6.
- Long range arena: A benchmark for efficient transformers, 2020c. URL https://arxiv.org/abs/2011.04006.
- Roberta: A robustly optimized BERT pretraining approach, 2019. URL http://arxiv.org/abs/1907.11692.
- An analysis of neural language modeling at multiple scales, 2018. URL https://arxiv.org/abs/1803.08240.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding, 2018. URL http://arxiv.org/abs/1804.07461.
- fairseq: A fast, extensible toolkit for sequence modeling. CoRR, abs/1904.01038, 2019. URL http://arxiv.org/abs/1904.01038.
- Toeplitz neural network for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=IxmWsm4xrua.
- Transformer quality in linear time, 2022.
- Ian Jolliffe. Principal component analysis, 2011. URL https://doi.org/10.1007/978-3-642-04898-2_455.
- Eric W Weisstein. Normal product distribution, 2003.
- Training data-efficient image transformers & distillation through attention. CoRR, abs/2012.12877, 2020. URL https://arxiv.org/abs/2012.12877.
- Yury Nahshan (6 papers)
- Joseph Kampeas (5 papers)
- Emir Haleva (4 papers)