Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linear Log-Normal Attention with Unbiased Concentration (2311.13541v4)

Published 22 Nov 2023 in cs.LG and cs.AI

Abstract: Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This limitation poses a substantial obstacle when dealing with long documents or high-resolution images. In this work, we study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. Furthermore, we propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention, designed to emulate the distribution and concentration behavior of the original self-attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives, offering a promising avenue for enhancing the scalability of transformer models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Attention is all you need, 2017. URL http://arxiv.org/abs/1706.03762.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. BERT: pre-training of deep bidirectional transformers for language understanding, 2018. URL http://arxiv.org/abs/1810.04805.
  4. An image is worth 16x16 words: Transformers for image recognition at scale, 2020. URL https://arxiv.org/abs/2010.11929.
  5. The best of both worlds: Combining recent advances in neural machine translation, 2018. URL http://arxiv.org/abs/1804.09849.
  6. HIBERT: document level pre-training of hierarchical bidirectional transformers for document summarization, 2019. URL http://arxiv.org/abs/1905.06566.
  7. On extractive and abstractive neural document summarization with transformer language models, November 2020. URL https://aclanthology.org/2020.emnlp-main.748.
  8. Neural machine translation by jointly learning to align and translate, January 2015. 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015.
  9. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  10. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Generating long sequences with sparse transformers, 2019. URL http://arxiv.org/abs/1904.10509.
  13. Big bird: Transformers for longer sequences, 2020. URL https://arxiv.org/abs/2007.14062.
  14. Rethinking attention with performers, 2020. URL https://arxiv.org/abs/2009.14794.
  15. Transformers are rnns: Fast autoregressive transformers with linear attention, 2020. URL https://arxiv.org/abs/2006.16236.
  16. Simon Coste. The spectral gap of sparse random digraphs, 2017. URL https://arxiv.org/abs/1708.00530.
  17. Transformers are deep infinite-dimensional non-mercer binary kernel machines, 2021. URL https://arxiv.org/abs/2106.01506.
  18. Elizbar A Nadaraya. On estimating regression, 1964.
  19. Robustify transformers with robust kernel density estimation, 2022. URL https://arxiv.org/abs/2210.05794.
  20. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel, 2019. URL http://arxiv.org/abs/1908.11775.
  21. J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations, 1909.
  22. Long-short transformer: Efficient transformers for language and vision, 2021. URL https://arxiv.org/abs/2107.02192.
  23. Axial attention in multidimensional transformers, 2019. URL http://arxiv.org/abs/1912.12180.
  24. Blockwise self-attention for long document understanding, 2019. URL http://arxiv.org/abs/1911.02972.
  25. Longformer: The long-document transformer, 2020. URL https://arxiv.org/abs/2004.05150.
  26. KVT: k-nn attention for boosting vision transformers. CoRR, abs/2106.00515, 2021. URL https://arxiv.org/abs/2106.00515.
  27. Linformer: Self-attention with linear complexity, 2020. URL https://arxiv.org/abs/2006.04768.
  28. Synthesizer: Rethinking self-attention in transformer models, 2020a. URL https://arxiv.org/abs/2005.00743.
  29. Nyströmformer: A nyström-based algorithm for approximating self-attention, 2021. URL https://arxiv.org/abs/2102.03902.
  30. Skyformer: Remodel self-attention with gaussian kernel and nyström method, 2021. URL https://arxiv.org/abs/2111.00035.
  31. cosformer: Rethinking softmax in attention, 2022a. URL https://arxiv.org/abs/2202.08791.
  32. Set transformer, 2018a. URL http://arxiv.org/abs/1810.00825.
  33. Compressive transformers for long-range sequence modelling, 2019. URL http://arxiv.org/abs/1911.05507.
  34. Random feature attention, 2021. URL https://arxiv.org/abs/2103.02143.
  35. Reformer: The efficient transformer, 2020. URL https://arxiv.org/abs/2001.04451.
  36. Efficient content-based sparse attention with routing transformers, 2020. URL https://arxiv.org/abs/2003.05997.
  37. Sparse sinkhorn attention, 2020b. URL https://arxiv.org/abs/2002.11296.
  38. Fast transformers with clustered attention, 2020. URL https://arxiv.org/abs/2007.04825.
  39. The devil in linear transformer, 2022b. URL https://arxiv.org/abs/2210.10340.
  40. Deep neural networks as gaussian processes, 2018b.
  41. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
  42. ACIQ: analytical clipping for integer quantization of neural networks. CoRR, abs/1810.05723, 2018. URL http://arxiv.org/abs/1810.05723.
  43. Leo A. Goodman. On the exact variance of products, 1960. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1960.10483369.
  44. Leslie H. Fenton. The sum of log-normal probability distributions in scatter transmission systems, 1960.
  45. Markov chains and mixing times. American Mathematical Society, 2006. URL http://scholar.google.com/scholar.bib?q=info:3wf9IU94tyMJ:scholar.google.com/&output=citation&hl=en&as_sdt=2000&ct=citation&cd=0.
  46. What does attention in neural machine translation pay attention to?, 2017. URL http://arxiv.org/abs/1710.03348.
  47. Analyzing the structure of attention in a transformer language model, 2019. URL http://arxiv.org/abs/1906.04284.
  48. C.D. Meyer. Matrix Analysis and Applied Linear Algebra. Society for Industrial Mathematics, 2000. URL http://www.matrixanalysis.com/DownloadChapters.html.
  49. Hans Samelson. On the perron-frobenius theorem., 1957.
  50. Broad distribution effects in sums of lognormal random variables, apr 2003. URL https://doi.org/10.1140%2Fepjb%2Fe2003-00131-6.
  51. Long range arena: A benchmark for efficient transformers, 2020c. URL https://arxiv.org/abs/2011.04006.
  52. Roberta: A robustly optimized BERT pretraining approach, 2019. URL http://arxiv.org/abs/1907.11692.
  53. An analysis of neural language modeling at multiple scales, 2018. URL https://arxiv.org/abs/1803.08240.
  54. GLUE: A multi-task benchmark and analysis platform for natural language understanding, 2018. URL http://arxiv.org/abs/1804.07461.
  55. fairseq: A fast, extensible toolkit for sequence modeling. CoRR, abs/1904.01038, 2019. URL http://arxiv.org/abs/1904.01038.
  56. Toeplitz neural network for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=IxmWsm4xrua.
  57. Transformer quality in linear time, 2022.
  58. Ian Jolliffe. Principal component analysis, 2011. URL https://doi.org/10.1007/978-3-642-04898-2_455.
  59. Eric W Weisstein. Normal product distribution, 2003.
  60. Training data-efficient image transformers & distillation through attention. CoRR, abs/2012.12877, 2020. URL https://arxiv.org/abs/2012.12877.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yury Nahshan (6 papers)
  2. Joseph Kampeas (5 papers)
  3. Emir Haleva (4 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com