Simple linear attention language models balance the recall-throughput tradeoff (2402.18668v1)
Abstract: Recent work has shown that attention-based LLMs excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the KV-cache's aggressive memory consumption. In this work, we explore whether we can improve LLM efficiency (e.g. by reducing memory consumption) without compromising on recall. By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model's state size and recall ability. We show that efficient alternatives to attention (e.g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train LLMs up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points. Implementations of linear attention are often less efficient than optimized standard attention implementations. To make BASED competitive, we develop IO-aware algorithms that enable 24x higher throughput on language generation than FlashAttention-2, when generating 1024 tokens using 1.3b parameter models. Code for this work is provided at: https://github.com/HazyResearch/based.
- Zoology: Measuring and improving recall in efficient language models. International Conference on Learning Representations, 2023a.
- Attention is all you need. volume 30, 2017.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023a.
- Rwkv: Reinventing rnns for the transformer era. arXiv:2305.13048, 2023.
- Hungry Hungry Hippos: Towards language modeling with state space models. In International Conference on Learning Representations, 2023a.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
- The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=4g02l2N2Nx.
- On the computational complexity of self-attention. In 34th International Conference on Algorithmic Learning Theory, volume 201, page 1–23, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Image transformer. In International conference on machine learning, pages 4055–4064. PMLR, 2018.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020a.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Hybrid random features. arXiv preprint arXiv:2110.04367, 2021.
- cosformer: Rethinking softmax in attention. arXiv preprint arXiv:2202.08791, 2022a.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), 2020b. URL https://arxiv.org/abs/2006.16236.
- Finetuning pretrained transformers into RNNs. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10630–10643, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.830. URL https://aclanthology.org/2021.emnlp-main.830.
- Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- Retentive network: A successor to transformer for large language models, 2023.
- In-context language learning: Architectures and algorithms. 2024. URL https://arxiv.org/abs/2401.12973.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Simple hardware-efficient long convolutions for sequence modeling. arXiv preprint arXiv:2302.06646, 2023b.
- Transformer dissection: a unified understanding of transformer’s attention via the lens of kernel. arXiv preprint arXiv:1908.11775, 2019.
- NVIDIA. Nvidia H100 tensor core GPU architecture, 2022.
- Fast transformers with clustered attention. 2020.
- Roformer: Enhanced transformer with rotary position embedding, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- How medical ai devices are evaluated: limitations and recommendations from an analysis of fda approvals. Nature Medicine, 27:1–3, 04 2021.
- Dom-lm: Learning generalizable representations for html documents. 2022.
- Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv:2304.09433, 2023b.
- SuperGLUE: a stickier benchmark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA, 2019.
- Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution, 2023.
- NVIDIA. Getting started with cuda graphs, 2019. URL https://developer.nvidia.com/blog/cuda-graphs/.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Big bird: Transformers for longer sequences. Proceedings of NeurIPS, 2020.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Long-short transformer: Efficient transformers for language and vision. Advances in neural information processing systems, 34:17723–17736, 2021.
- Sumformer: Universal approximation for efficient transformers. arXiv preprint arXiv:2307.02301, 2023.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148, 2021.
- Skyformer: Remodel self-attention with gaussian kernel and nystr\”om method. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021a. URL https://openreview.net/forum?id=pZCYG7gjkKz.
- An exploration of softmax alternatives belonging to the spherical loss family. arXiv preprint arXiv:1511.05042, 2015.
- Scatterbrain: Unifying sparse and low-rank attention approximation. arXiv preprint arXiv:2110.15343, 2021b.
- The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7025–7041, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.473. URL https://aclanthology.org/2022.emnlp-main.473.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965.
- Ckconv: Continuous kernel convolution for sequential data. 2022.
- Diagonal state spaces are as effective as structured state spaces, 2022.
- On the parameterization and initialization of diagonal state space models, 2022.
- Long range language modeling via gated state spaces, 2022.
- Mega: Moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
- Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
- Sparse modular activation for efficient sequence modeling. arXiv preprint arXiv:2306.11197, 2023.
- Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
- Flashfftconv: Efficient convolutions for long sequences with tensor cores. arXiv preprint arXiv:2311.05908, 2023c.
- Self-attention does not need o(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
- Mnnfast: A fast and scalable system architecture for memory-augmented neural networks. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pages 250–263, 2019.
- Blockwise parallel transformer for long context large models. arXiv preprint arXiv:2305.19370, 2023.
- Transformer quality in linear time. In International Conference on Machine Learning, pages 9099–9117. PMLR, 2022.
- StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models. 12 2023b. doi: 10.57967/hf/1595. URL https://github.com/togethercomputer/stripedhyena.
- Gaussian error linear units (gelus), 2023.
- Genomic benchmarks: A collection of datasets for genomic sequence classification. bioRxiv, 2022. doi: 10.1101/2022.06.08.495248. URL https://www.biorxiv.org/content/early/2022/06/10/2022.06.08.495248.
- The lambada dataset: Word prediction requiring a broad discourse context, 2016.
- Hellaswag: Can a machine really finish your sentence?, 2019.
- Piqa: Reasoning about physical commonsense in natural language, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- Winogrande: An adversarial winograd schema challenge at scale, 2019.
- OpenCeres: When open information extraction meets the semi-structured web. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3047–3056, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1309. URL https://aclanthology.org/N19-1309.
- Algebraic complexity theory, volume 315. Springer Science & Business Media, 2013.
- The one-way communication complexity of hamming distance. Theory of Computing, 4(1):129–135, 2008.
- Swastik Kopparty. Topics in algorithms and complexity theory: Spring 2020. 2020.
- Simran Arora (64 papers)
- Sabri Eyuboglu (13 papers)
- Michael Zhang (81 papers)
- Aman Timalsina (6 papers)
- Silas Alberti (8 papers)
- Dylan Zinsley (2 papers)
- James Zou (232 papers)
- Atri Rudra (55 papers)
- Christopher Ré (194 papers)