Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simple linear attention language models balance the recall-throughput tradeoff (2402.18668v1)

Published 28 Feb 2024 in cs.CL and cs.LG

Abstract: Recent work has shown that attention-based LLMs excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the KV-cache's aggressive memory consumption. In this work, we explore whether we can improve LLM efficiency (e.g. by reducing memory consumption) without compromising on recall. By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model's state size and recall ability. We show that efficient alternatives to attention (e.g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train LLMs up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points. Implementations of linear attention are often less efficient than optimized standard attention implementations. To make BASED competitive, we develop IO-aware algorithms that enable 24x higher throughput on language generation than FlashAttention-2, when generating 1024 tokens using 1.3b parameter models. Code for this work is provided at: https://github.com/HazyResearch/based.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Zoology: Measuring and improving recall in efficient language models. International Conference on Learning Representations, 2023a.
  2. Attention is all you need. volume 30, 2017.
  3. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  4. Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022.
  5. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  6. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
  7. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023a.
  8. Rwkv: Reinventing rnns for the transformer era. arXiv:2305.13048, 2023.
  9. Hungry Hungry Hippos: Towards language modeling with state space models. In International Conference on Learning Representations, 2023a.
  10. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  11. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  12. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  13. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=4g02l2N2Nx.
  14. On the computational complexity of self-attention. In 34th International Conference on Algorithmic Learning Theory, volume 201, page 1–23, 2023.
  15. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  16. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  17. Image transformer. In International conference on machine learning, pages 4055–4064. PMLR, 2018.
  18. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  19. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  20. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020a.
  21. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  22. Hybrid random features. arXiv preprint arXiv:2110.04367, 2021.
  23. cosformer: Rethinking softmax in attention. arXiv preprint arXiv:2202.08791, 2022a.
  24. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), 2020b. URL https://arxiv.org/abs/2006.16236.
  25. Finetuning pretrained transformers into RNNs. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10630–10643, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.830. URL https://aclanthology.org/2021.emnlp-main.830.
  26. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021.
  27. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  28. Retentive network: A successor to transformer for large language models, 2023.
  29. In-context language learning: Architectures and algorithms. 2024. URL https://arxiv.org/abs/2401.12973.
  30. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  31. Simple hardware-efficient long convolutions for sequence modeling. arXiv preprint arXiv:2302.06646, 2023b.
  32. Transformer dissection: a unified understanding of transformer’s attention via the lens of kernel. arXiv preprint arXiv:1908.11775, 2019.
  33. NVIDIA. Nvidia H100 tensor core GPU architecture, 2022.
  34. Fast transformers with clustered attention. 2020.
  35. Roformer: Enhanced transformer with rotary position embedding, 2023.
  36. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  37. How medical ai devices are evaluated: limitations and recommendations from an analysis of fda approvals. Nature Medicine, 27:1–3, 04 2021.
  38. Dom-lm: Learning generalizable representations for html documents. 2022.
  39. Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv:2304.09433, 2023b.
  40. SuperGLUE: a stickier benchmark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA, 2019.
  41. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution, 2023.
  42. NVIDIA. Getting started with cuda graphs, 2019. URL https://developer.nvidia.com/blog/cuda-graphs/.
  43. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  44. Big bird: Transformers for longer sequences. Proceedings of NeurIPS, 2020.
  45. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  46. Long-short transformer: Efficient transformers for language and vision. Advances in neural information processing systems, 34:17723–17736, 2021.
  47. Sumformer: Universal approximation for efficient transformers. arXiv preprint arXiv:2307.02301, 2023.
  48. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
  49. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148, 2021.
  50. Skyformer: Remodel self-attention with gaussian kernel and nystr\”om method. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021a. URL https://openreview.net/forum?id=pZCYG7gjkKz.
  51. An exploration of softmax alternatives belonging to the spherical loss family. arXiv preprint arXiv:1511.05042, 2015.
  52. Scatterbrain: Unifying sparse and low-rank attention approximation. arXiv preprint arXiv:2110.15343, 2021b.
  53. The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7025–7041, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.473. URL https://aclanthology.org/2022.emnlp-main.473.
  54. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  55. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  56. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965.
  57. Ckconv: Continuous kernel convolution for sequential data. 2022.
  58. Diagonal state spaces are as effective as structured state spaces, 2022.
  59. On the parameterization and initialization of diagonal state space models, 2022.
  60. Long range language modeling via gated state spaces, 2022.
  61. Mega: Moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
  62. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
  63. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
  64. Sparse modular activation for efficient sequence modeling. arXiv preprint arXiv:2306.11197, 2023.
  65. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
  66. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  67. Flashfftconv: Efficient convolutions for long sequences with tensor cores. arXiv preprint arXiv:2311.05908, 2023c.
  68. Self-attention does not need o⁢(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
  69. Mnnfast: A fast and scalable system architecture for memory-augmented neural networks. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pages 250–263, 2019.
  70. Blockwise parallel transformer for long context large models. arXiv preprint arXiv:2305.19370, 2023.
  71. Transformer quality in linear time. In International Conference on Machine Learning, pages 9099–9117. PMLR, 2022.
  72. StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models. 12 2023b. doi: 10.57967/hf/1595. URL https://github.com/togethercomputer/stripedhyena.
  73. Gaussian error linear units (gelus), 2023.
  74. Genomic benchmarks: A collection of datasets for genomic sequence classification. bioRxiv, 2022. doi: 10.1101/2022.06.08.495248. URL https://www.biorxiv.org/content/early/2022/06/10/2022.06.08.495248.
  75. The lambada dataset: Word prediction requiring a broad discourse context, 2016.
  76. Hellaswag: Can a machine really finish your sentence?, 2019.
  77. Piqa: Reasoning about physical commonsense in natural language, 2019.
  78. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  79. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  80. OpenCeres: When open information extraction meets the semi-structured web. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3047–3056, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1309. URL https://aclanthology.org/N19-1309.
  81. Algebraic complexity theory, volume 315. Springer Science & Business Media, 2013.
  82. The one-way communication complexity of hamming distance. Theory of Computing, 4(1):129–135, 2008.
  83. Swastik Kopparty. Topics in algorithms and complexity theory: Spring 2020. 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Simran Arora (64 papers)
  2. Sabri Eyuboglu (13 papers)
  3. Michael Zhang (81 papers)
  4. Aman Timalsina (6 papers)
  5. Silas Alberti (8 papers)
  6. Dylan Zinsley (2 papers)
  7. James Zou (232 papers)
  8. Atri Rudra (55 papers)
  9. Christopher Ré (194 papers)
Citations (38)

Summary

Simple Linear Attention LLMs Balance the Recall-Throughput Tradeoff

The paper "Simple linear attention LLMs balance the recall-throughput tradeoff" investigates the recall efficiency of attention-based LLMs and proposes a novel architecture named Based to enhance performance metrics by addressing the inherent memory consumption tradeoffs during inference.

Introduction and Problem Statement

Attention-based LLMs are well-documented for their superior recall abilities, effectively grounding generations in prior context tokens. However, they suffer from significant inefficiencies regarding memory usage during inference. The paper centralizes the query: Can LLM efficiency be improved, particularly in terms of memory consumption, without deteriorating recall ability?

Empirical Analysis and Tradeoffs

Empirical evaluations demonstrate a fundamental tradeoff between a LLM’s recurrent state size (memory consumption during inference) and its recall functionality. Through a series of synthetic Multi-Query Associative Recall (MQAR) tasks and theoretical analyses, the authors highlight the inefficiencies of several model architectures related to recall abilities and their corresponding memory footprints.

Results:

  1. Recall-Memory Tradeoff:
    • Attention-based models achieve perfect recall but at the cost of a recurrent state size growing linearly with the sequence length.
    • Efficient alternatives such as H3 and Mamba, which maintain a fixed-size recurrent state, exhibit poor recall performance.
  2. Architecture Specifics:
    • Linear attention and sliding window attention alone fail to provide a satisfactory balance between memory and recall.
    • The combined architecture Based, incorporating both linear and sliding window attention, successfully traverses the Pareto frontier of the recall-memory tradeoff.

The Based Architecture

Elements and Design:

  1. Linear Attention:
    • A significant component is softmax-approximating linear attention using a 2nd-order Taylor series feature map. This approximation maintains global token interactions with fixed recurrent state size, modulated by feature dimension (d').
  2. Sliding Window Attention:
    • A combination with sliding window attention allows for efficient modeling of local interactions, where small-sized windows (e.g., 64 tokens) mitigate the inefficiencies of larger window sizes.

The resultant model traverses the recall-memory tradeoff effectively, aligning near attention-based models for recall, while sustaining memory use low — a crucial efficiency gain.

Theoretical Foundations

The authors establish lower bounds on recall-related memory consumption for recurrent models, underscoring the fundamental tradeoffs. Applying results from communication complexity theory, they ascertain the equivalence and simulation capabilities between Based and canonical gated-convolution architectures (BaseConv) in recall tasks. Additionally, the model's memory space complexity and the number of required layers for exact recall functions are theoretically bounded.

Experimental Results

Evaluations on LLMs trained on up to 1.3 billion parameters illustrate that Based:

  1. Matches the strongest sub-quadratic architectures like Mamba in perplexity scores.
  2. Outperforms on associative recall-intensive tasks by 6.22 accuracy points.
  3. Demonstrates up to 24 times higher throughput on language generation tasks compared to FlashAttention-2.

Implications and Future Directions

Practical Implications:

  • The findings emphasize the practical viability of implementing Based in real-world scenarios, leveraging high recall accuracy with vastly improved throughput rates.
  • Applications span from information extraction, reading comprehension, to code generation where recall precision could significantly enhance task outcomes.

Theoretical Contributions:

  • The development and analysis of Based contribute to the theoretical frameworks understanding the memory-recall tradeoff.
  • Future research could explore optimizing model parameters specific to the needs of various downstream tasks, investigating simplified feature maps, or extending architectures to encompass broader input dependencies.

Speculative Future Trends in AI:

  • As the field moves towards models balancing extensive context with efficient processing, architectures like Based set a precedent for future innovations targeting computational sustainability and accuracy.
  • Development in AI hardware and specialized accelerators could further mitigate current limitations specific to certain approximation methods.

By establishing a frontier in the recall-throughput tradeoff, this paper effectively progresses the broader discourse on model efficiency and opens avenues for optimizing performance in complex, real-world language generation tasks.

Youtube Logo Streamline Icon: https://streamlinehq.com