Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attention is Naturally Sparse with Gaussian Distributed Input (2404.02690v1)

Published 3 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The computational intensity of LLMs is a critical bottleneck, primarily due to the $O(n2)$ complexity of the attention mechanism in transformer architectures. Addressing this, sparse attention emerges as a key innovation, aiming to reduce computational load while maintaining model performance. This study presents a rigorous theoretical analysis of the sparsity in attention scores within LLMs, particularly under the framework of Gaussian inputs. By establishing a set of foundational assumptions and employing a methodical theoretical approach, we unravel the intrinsic characteristics of attention score sparsity and its implications on computational efficiency. Our main contribution lies in providing a detailed theoretical examination of how sparsity manifests in attention mechanisms, offering insights into the potential trade-offs between computational savings and model effectiveness. This work not only advances our understanding of sparse attention but also provides a scaffold for future research in optimizing the computational frameworks of LLMs, paving the way for more scalable and efficient AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. One pass streaming algorithm for super long token attention approximation in sublinear space. arXiv preprint arXiv:2311.14652, 2023.
  2. Fast attention requires bounded entries. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
  3. The fine-grained complexity of gradient computation for training large language models. arXiv preprint arXiv:2402.04497, 2024.
  4. How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
  5. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  6. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Quantizable transformers: Removing outliers by helping attention heads do nothing. Advances in Neural Information Processing Systems, 36, 2024.
  9. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  10. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv preprint arXiv:2304.02207, 2023.
  11. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  12. ChatGPT. Optimizing language models for dialogue. OpenAI Blog, November 2022.
  13. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  14. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. arXiv preprint arXiv:2102.10707, 2021.
  15. Mongoose: A learnable lsh framework for efficient neural network training. In International Conference on Learning Representations, 2020.
  16. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  17. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015, 2019.
  18. Fine-tune language models to approximate unbiased in-context learning. arXiv preprint arXiv:2310.03331, 2023.
  19. How to protect copyright data in optimization of large language models? arXiv preprint arXiv:2308.12247, 2023.
  20. How to protect copyright data in optimization of large language models? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17871–17879, 2024.
  21. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  22. Dynamic kernel sparsifiers. arXiv preprint arXiv:2211.14825, 2022.
  23. Attentive walk-aggregating graph neural networks. Transactions on Machine Learning Research, 2022.
  24. Zero-th order algorithm for softmax attention optimization. arXiv preprint arXiv:2307.08352, 2023.
  25. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
  26. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arXiv preprint arXiv:2304.04397, 2023.
  27. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
  28. Unmasking transformers: A theoretical approach to data recovery via attention weights. arXiv preprint arXiv:2310.12462, 2023.
  29. Superiority of softmax: Unveiling the performance edge over linear attention. arXiv preprint arXiv:2310.11685, 2023.
  30. Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic. arXiv preprint arXiv:2402.09469, 2024.
  31. A sublinear adversarial training algorithm. arXiv preprint arXiv:2208.05395, 2022.
  32. A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time. arXiv preprint arXiv:2309.07418, 2023.
  33. In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023.
  34. Gradientcoin: A peer-to-peer decentralized large language models. arXiv preprint arXiv:2308.10502, 2023.
  35. Outlier-efficient hopfield layers for large transformer-based models. 2024.
  36. Nonparametric modern hopfield models. 2024.
  37. Hyperattention: Long-context attention in near-linear time. arXiv preprint arXiv:2310.05869, 2023.
  38. On computational limits of modern hopfield models: A fine-grained complexity analysis. arXiv preprint arXiv:2402.04520, 2024.
  39. Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10):4723–4741, 2009.
  40. Attention mechanism for neural machine translation: A survey. In 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), volume 5, pages 1485–1489. IEEE, 2021.
  41. On sparse modern hopfield model, 2023.
  42. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  43. Polysketchformer: Fast transformers via sketches for polynomial kernels. arXiv preprint arXiv:2310.01655, 2023.
  44. Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
  45. Proxyformer: Nyström-based linear transformer with trainable proxy tokens. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13418–13426, 2024.
  46. Large memory layers with product keys. Advances in Neural Information Processing Systems, 32, 2019.
  47. A theoretical insight into attack and defense of gradient leakage in transformer. arXiv preprint arXiv:2311.13624, 2023.
  48. Solving regularized exp, cosh and sinh regression problems. arXiv preprint, 2303.15725, 2023.
  49. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.
  50. Fine-tuning language models with just forward passes. arXiv preprint arXiv:2305.17333, 2023.
  51. Evan Miller. Attention is off by one. 2023.
  52. A kernel-based view of language model fine-tuning. In International Conference on Machine Learning, pages 23610–23641. PMLR, 2023.
  53. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  54. Task-specific skill localization in fine-tuned language models. arXiv preprint arXiv:2302.06600, 2023.
  55. Adore: Differentially oblivious relational database operators. In VLDB, 2022.
  56. Adaptive and dynamic multi-resolution hashing for pairwise summations. In BigData, 2022.
  57. Fast submodular function maximization. arXiv preprint arXiv:2305.08367, 2023.
  58. Efficient sgd neural network training via sublinear activated neuron identification. arXiv preprint arXiv:2307.06565, 2023.
  59. A general algorithm for solving rank-one matrix sensing. arXiv preprint arXiv:2303.12298, 2023.
  60. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 101–156. PMLR, 2023.
  61. Improving language understanding by generative pre-training. ., 2018.
  62. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  63. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  64. The trade-off between universality and label efficiency of representations from contrastive learning. In The Eleventh International Conference on Learning Representations, 2023.
  65. Domain generalization via nuclear norm regularization. In Conference on Parsimony and Learning, pages 179–201. PMLR, 2024.
  66. Deep online fused video stabilization. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1250–1258, 2022.
  67. A graph-theoretic framework for understanding open-world semi-supervised learning. Advances in Neural Information Processing Systems, 36, 2024.
  68. When and how does known class help discover unknown ones? provable understanding through spectral analysis. In International Conference on Machine Learning, pages 33014–33043. PMLR, 2023.
  69. A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In International Conference on Learning Representations, 2022.
  70. Provable guarantees for neural networks via gradient feature learning. Advances in Neural Information Processing Systems, 36, 2024.
  71. A unified scheme of resnet and softmax. arXiv preprint arXiv:2309.13482, 2023.
  72. An automatic learning rate schedule algorithm for achieving faster convergence and steeper descent. arXiv preprint arXiv:2310.11291, 2023.
  73. Sparse attention with learning to hash. In International Conference on Learning Representations, 2021.
  74. Solving attention kernel regression problem via pre-conditioner. arXiv preprint arXiv:2308.14304, 2023.
  75. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
  76. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  77. Uniform memory retrieval with larger capacity for modern hopfield models. 2024.
  78. STanhop: Sparse tandem hopfield model for memory-enhanced time series prediction. In The Twelfth International Conference on Learning Representations, 2024.
  79. Bishop: Bi-directional cellular learning for tabular data with generalized sparse modern hopfield model. 2024.
  80. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  81. Towards few-shot adaptation of foundation models via multitask finetuning. In The Twelfth International Conference on Learning Representations, 2024.
  82. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  83. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148, 2021.
  84. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  85. Kdeformer: Accelerating transformers via kernel density estimation. In International Conference on Machine Learning, pages 40605–40623. PMLR, 2023.
  86. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  87. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  88. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yichuan Deng (21 papers)
  2. Zhao Song (253 papers)
  3. Chiwun Yang (14 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets