Attention is Naturally Sparse with Gaussian Distributed Input (2404.02690v1)
Abstract: The computational intensity of LLMs is a critical bottleneck, primarily due to the $O(n2)$ complexity of the attention mechanism in transformer architectures. Addressing this, sparse attention emerges as a key innovation, aiming to reduce computational load while maintaining model performance. This study presents a rigorous theoretical analysis of the sparsity in attention scores within LLMs, particularly under the framework of Gaussian inputs. By establishing a set of foundational assumptions and employing a methodical theoretical approach, we unravel the intrinsic characteristics of attention score sparsity and its implications on computational efficiency. Our main contribution lies in providing a detailed theoretical examination of how sparsity manifests in attention mechanisms, offering insights into the potential trade-offs between computational savings and model effectiveness. This work not only advances our understanding of sparse attention but also provides a scaffold for future research in optimizing the computational frameworks of LLMs, paving the way for more scalable and efficient AI systems.
- One pass streaming algorithm for super long token attention approximation in sublinear space. arXiv preprint arXiv:2311.14652, 2023.
- Fast attention requires bounded entries. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
- The fine-grained complexity of gradient computation for training large language models. arXiv preprint arXiv:2402.04497, 2024.
- How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Quantizable transformers: Removing outliers by helping attention heads do nothing. Advances in Neural Information Processing Systems, 36, 2024.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Algorithm and hardness for dynamic attention maintenance in large language models. arXiv preprint arXiv:2304.02207, 2023.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- ChatGPT. Optimizing language models for dialogue. OpenAI Blog, November 2022.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. arXiv preprint arXiv:2102.10707, 2021.
- Mongoose: A learnable lsh framework for efficient neural network training. In International Conference on Learning Representations, 2020.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Adaptively sparse transformers. arXiv preprint arXiv:1909.00015, 2019.
- Fine-tune language models to approximate unbiased in-context learning. arXiv preprint arXiv:2310.03331, 2023.
- How to protect copyright data in optimization of large language models? arXiv preprint arXiv:2308.12247, 2023.
- How to protect copyright data in optimization of large language models? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17871–17879, 2024.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Dynamic kernel sparsifiers. arXiv preprint arXiv:2211.14825, 2022.
- Attentive walk-aggregating graph neural networks. Transactions on Machine Learning Research, 2022.
- Zero-th order algorithm for softmax attention optimization. arXiv preprint arXiv:2307.08352, 2023.
- Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
- Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arXiv preprint arXiv:2304.04397, 2023.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
- Unmasking transformers: A theoretical approach to data recovery via attention weights. arXiv preprint arXiv:2310.12462, 2023.
- Superiority of softmax: Unveiling the performance edge over linear attention. arXiv preprint arXiv:2310.11685, 2023.
- Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic. arXiv preprint arXiv:2402.09469, 2024.
- A sublinear adversarial training algorithm. arXiv preprint arXiv:2208.05395, 2022.
- A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time. arXiv preprint arXiv:2309.07418, 2023.
- In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023.
- Gradientcoin: A peer-to-peer decentralized large language models. arXiv preprint arXiv:2308.10502, 2023.
- Outlier-efficient hopfield layers for large transformer-based models. 2024.
- Nonparametric modern hopfield models. 2024.
- Hyperattention: Long-context attention in near-linear time. arXiv preprint arXiv:2310.05869, 2023.
- On computational limits of modern hopfield models: A fine-grained complexity analysis. arXiv preprint arXiv:2402.04520, 2024.
- Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10):4723–4741, 2009.
- Attention mechanism for neural machine translation: A survey. In 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), volume 5, pages 1485–1489. IEEE, 2021.
- On sparse modern hopfield model, 2023.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Polysketchformer: Fast transformers via sketches for polynomial kernels. arXiv preprint arXiv:2310.01655, 2023.
- Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
- Proxyformer: Nyström-based linear transformer with trainable proxy tokens. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13418–13426, 2024.
- Large memory layers with product keys. Advances in Neural Information Processing Systems, 32, 2019.
- A theoretical insight into attack and defense of gradient leakage in transformer. arXiv preprint arXiv:2311.13624, 2023.
- Solving regularized exp, cosh and sinh regression problems. arXiv preprint, 2303.15725, 2023.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.
- Fine-tuning language models with just forward passes. arXiv preprint arXiv:2305.17333, 2023.
- Evan Miller. Attention is off by one. 2023.
- A kernel-based view of language model fine-tuning. In International Conference on Machine Learning, pages 23610–23641. PMLR, 2023.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Task-specific skill localization in fine-tuned language models. arXiv preprint arXiv:2302.06600, 2023.
- Adore: Differentially oblivious relational database operators. In VLDB, 2022.
- Adaptive and dynamic multi-resolution hashing for pairwise summations. In BigData, 2022.
- Fast submodular function maximization. arXiv preprint arXiv:2305.08367, 2023.
- Efficient sgd neural network training via sublinear activated neuron identification. arXiv preprint arXiv:2307.06565, 2023.
- A general algorithm for solving rank-one matrix sensing. arXiv preprint arXiv:2303.12298, 2023.
- An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 101–156. PMLR, 2023.
- Improving language understanding by generative pre-training. ., 2018.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- The trade-off between universality and label efficiency of representations from contrastive learning. In The Eleventh International Conference on Learning Representations, 2023.
- Domain generalization via nuclear norm regularization. In Conference on Parsimony and Learning, pages 179–201. PMLR, 2024.
- Deep online fused video stabilization. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1250–1258, 2022.
- A graph-theoretic framework for understanding open-world semi-supervised learning. Advances in Neural Information Processing Systems, 36, 2024.
- When and how does known class help discover unknown ones? provable understanding through spectral analysis. In International Conference on Machine Learning, pages 33014–33043. PMLR, 2023.
- A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In International Conference on Learning Representations, 2022.
- Provable guarantees for neural networks via gradient feature learning. Advances in Neural Information Processing Systems, 36, 2024.
- A unified scheme of resnet and softmax. arXiv preprint arXiv:2309.13482, 2023.
- An automatic learning rate schedule algorithm for achieving faster convergence and steeper descent. arXiv preprint arXiv:2310.11291, 2023.
- Sparse attention with learning to hash. In International Conference on Learning Representations, 2021.
- Solving attention kernel regression problem via pre-conditioner. arXiv preprint arXiv:2308.14304, 2023.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Uniform memory retrieval with larger capacity for modern hopfield models. 2024.
- STanhop: Sparse tandem hopfield model for memory-enhanced time series prediction. In The Twelfth International Conference on Learning Representations, 2024.
- Bishop: Bi-directional cellular learning for tabular data with generalized sparse modern hopfield model. 2024.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
- Towards few-shot adaptation of foundation models via multitask finetuning. In The Twelfth International Conference on Learning Representations, 2024.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148, 2021.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Kdeformer: Accelerating transformers via kernel density estimation. In International Conference on Machine Learning, pages 40605–40623. PMLR, 2023.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Yichuan Deng (21 papers)
- Zhao Song (253 papers)
- Chiwun Yang (14 papers)