Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (2306.14048v3)

Published 24 Jun 2023 in cs.LG
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Abstract: LLMs, despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H$_2$). Through a comprehensive investigation, we find that (i) the emergence of H$_2$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H$_2$O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29$\times$, 29$\times$, and 3$\times$ on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9$\times$. The code is available at https://github.com/FMInference/H2O.

This work tackles a practical challenge in using LLMs for generating long texts. One major issue with these models is that during generation they must save many intermediate “key” and “value” representations (together known as the KV cache) that are used to calculate what the model should say next. As a conversation or story grows longer, the memory needed to store all these KV pairs increases linearly, and that can be very demanding in terms of computer memory and speed.

Below is an explanation of the main ideas behind the approach:

Understanding the KV Cache Problem

  • During text generation, LLMs compute an “attention” score for every previous token when predicting a new one. This means if you have, say, a thousand tokens already, the system must look at all thousand of them again at each step.
  • The intermediate results for each token, the key–value embeddings, are stored in a cache. The memory required grows with the number of tokens, which makes inference on long texts expensive and sometimes impractical.

Key Observations That Inspire the Approach

  • Although the models are trained with full dense attention (looking at every token in the past), in practice the attention scores tend to be very “sparse.” In other words, only a small fraction of the previous tokens really matter for predicting the next word.
  • Empirically, it is observed that the aggregated attention scores follow a power-law distribution. This means that a few tokens (called “heavy hitters”) contribute most of the value when the model computes what to say next.

The Heavy-Hitter Oracle (H2O) Approach

  • Recognizing that only a small group of tokens is really influential, the proposed method constructs a smarter eviction strategy for the KV cache. Instead of keeping all tokens or simply keeping only the most recent ones, H2O dynamically decides which tokens are the “heavy hitters.”
  • At every generation step, the method examines the attention scores and uses a greedy algorithm to determine which token to remove from the cache if necessary. The greedy algorithm is chosen because it is efficient and, under certain assumptions about the structure of attention (specifically, if it behaves like a submodular function), it can be nearly optimal.
  • The strategy cleverly blends the retention of both recent tokens and those that have been observed to carry high attention scores. In practice, even when the cache size is reduced (down to only 20% of the original memory requirement), this method manages to maintain the quality of the text generated.

Benefits and Theoretical Guarantees

  • Thanks to this approach, the memory footprint of the KV cache can be reduced significantly, leading to faster and more efficient inference.
  • The researchers provide a formal statement showing that, under some mild assumptions, the greedy algorithm’s performance is close to that of an ideal strategy. The idea is formulated as a “dynamic submodular maximization” problem, which is a way to mathematically capture the idea of efficiently choosing a small but effective subset.
  • Experimental results on several families of LLMs (like OPT, LLaMA, and GPT-NeoX) and across different tasks show that not only does H2O help to reduce memory use but it also improves the throughput (tokens generated per second) quite dramatically, all while preserving or even slightly improving the quality of the generated text.

Practical Implications

  • This method is particularly important for applications that require long-context generation such as dialogue systems, story writing, or summarizing long documents.
  • By reducing the memory burden, systems can run faster and more economically, making advanced LLMs more accessible for a wider range of applications.

Overall, the paper presents an innovative solution to a key bottleneck in deploying LLMs. By identifying and retaining the critical “heavy hitter” tokens within the KV cache, the H2O approach allows for efficient text generation, reducing memory usage and increasing speed without sacrificing performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (144)
  1. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  2. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852, 2022.
  3. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  4. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848, 2023.
  5. Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102, 2022.
  6. An anomaly in space-time characteristics of certain programs running in a paging machine. Communications of the ACM, 12(6):349–353, 1969.
  7. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  8. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  9. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  10. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  11. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  12. Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  13. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  14. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467, 2023.
  15. A framework for few-shot language model evaluation. In Zenodo. https://doi.org/10.5281/zenodo.5371628, September 2021.
  16. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  17. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032, 2022.
  18. HuggingFace. Hugging face accelerate. https://huggingface.co/docs/accelerate/index.
  19. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
  20. Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023.
  21. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  22. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:2310.05175, 2023.
  23. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  24. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  25. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
  26. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, 2022.
  27. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  28. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  29. Colt5: Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752, 2023.
  30. Dynamic context pruning for efficient and interpretable autoregressive transformers. arXiv preprint arXiv:2305.15805, 2023.
  31. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020.
  32. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021.
  33. The lru-k page replacement algorithm for database disk buffering. Acm Sigmod Record, 22(2):297–306, 1993.
  34. Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE transactions on Computers, 50(12):1352–1361, 2001.
  35. On the expressive power of self-attention matrices. arXiv preprint arXiv:2106.03764, 2021.
  36. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pages 5793–5831. PMLR, 2022.
  37. Laszlo A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems journal, 5(2):78–101, 1966.
  38. Resurrecting submodularity for neural text generation. arXiv preprint arXiv:1911.03014, 2019.
  39. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  41. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models, 2022.
  42. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011.
  43. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  44. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
  45. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  46. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  47. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  48. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
  49. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
  50. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  51. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  52. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  53. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
  54. Compressive transformers for long-range sequence modelling. In The International Conference on Learning Representations (ICLR), 2020.
  55. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  56. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
  57. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
  58. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pages 7543–7552. PMLR, 2019.
  59. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
  60. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
  61. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4340–4349, 2019.
  62. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(241):1–124, 2021.
  63. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
  64. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802, 2019.
  65. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136, 2019.
  66. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
  67. Attention is all you need. In NIPS, 2017.
  68. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
  69. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  70. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
  71. Radbert-cl: Factually-aware contrastive learning for radiology report classification. Proceedings of machine learning research, 158:196–208, 2021.
  72. End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718, 2019.
  73. Cognitive graph for multi-hop reading comprehension at scale. arXiv preprint arXiv:1905.05460, 2019.
  74. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  75. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712, 2023.
  76. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  77. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  78. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  79. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  80. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  81. Why {adam} beats {sgd} for attention models, 2020.
  82. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
  83. Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249, 2020.
  84. Sequence length is a domain: Length-based overfitting in transformer models. arXiv preprint arXiv:2109.07276, 2021.
  85. Mixup training leads to reduced overfitting and improved calibration for the transformer architecture. arXiv preprint arXiv:2102.11402, 2021.
  86. Very deep transformers for neural machine translation. arXiv preprint arXiv:2008.07772, 2020.
  87. Optimizing deeper transformers on small datasets. arXiv preprint arXiv:2012.15355, 2020.
  88. Gradinit: Learning to initialize neural networks for stable and efficient training. Advances in Neural Information Processing Systems, 34:16410–16422, 2021.
  89. Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, 2022.
  90. Deepnet: Scaling transformers to 1,000 layers. arXiv preprint arXiv:2203.00555, 2022.
  91. Unified normalization for accelerating and stabilizing transformers. arXiv preprint arXiv:2208.01313, 2022.
  92. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  93. Texygen: A benchmarking platform for text generation models. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100, 2018.
  94. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
  95. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  96. Improving length-generalization in transformers via task hinting. arXiv preprint arXiv:2310.00726, 2023.
  97. Kdeformer: Accelerating transformers via kernel density estimation. arXiv preprint arXiv:2302.02451, 2023.
  98. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
  99. Superiority of softmax: Unveiling the performance edge over linear attention. arXiv preprint arXiv:2310.11685, 2023.
  100. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv preprint arXiv:2304.02207, 2023.
  101. Differentially private attention computation. arXiv preprint arXiv:2305.04701, 2023.
  102. Representational strengths and limitations of transformers. arXiv preprint arXiv:2306.02896, 2023.
  103. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arxiv preprint: arxiv 2304.03426, 2023.
  104. Solving regularized exp, cosh and sinh regression problems. arXiv preprint, 2303.15725, 2023.
  105. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
  106. Polysketchformer: Fast transformers via sketches for polynomial kernels. arXiv preprint arXiv:2310.01655, 2023.
  107. An iterative algorithm for rescaled hyperbolic functions regression. arXiv preprint arXiv:2305.00660, 2023.
  108. Hyperattention: Long-context attention in near-linear time. arXiv preprint arXiv:2310.05869, 2023.
  109. How to protect copyright data in optimization of large language models? arXiv preprint arXiv:2308.12247, 2023.
  110. A mathematical abstraction for balancing the trade-off between creativity and reality in large language models. arXiv preprint arXiv:2306.02295, 2023.
  111. A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time. arXiv preprint arXiv:2309.07418, 2023.
  112. A very preliminary analysis of dall-e 2. arXiv preprint arXiv:2204.13807, 2022.
  113. Fast quantum algorithm for attention computation. arXiv preprint arXiv:2307.08045, 2023.
  114. On the optimization and generalization of multi-head attention. arXiv preprint arXiv:2310.12680, 2023.
  115. Gradientcoin: A peer-to-peer decentralized large language models. arXiv preprint arXiv:2308.10502, 2023.
  116. Unmasking transformers: A theoretical approach to data recovery via attention weights. arXiv preprint arXiv:2310.12462, 2023.
  117. Zero-th order algorithm for softmax attention optimization. arXiv preprint arXiv:2307.08352, 2023.
  118. How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. arXiv preprint arXiv:2310.04064, 2023.
  119. Convergence of two-layer regression with nonlinear units. arXiv preprint arXiv:2308.08358, 2023.
  120. Fine-tune language models to approximate unbiased in-context learning. arXiv preprint arXiv:2310.03331, 2023.
  121. Trainable transformer in transformer. arXiv preprint arXiv:2307.01189, 2023.
  122. Do transformers parse while predicting the masked word? arXiv preprint arXiv:2303.08117, 2023.
  123. Alexander Schrijver. Combinatorial optimization: polyhedra and efficiency, volume 24. Springer, 2003.
  124. Heavy hitters via cluster-preserving clustering. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 61–70. IEEE, 2016.
  125. Stronger l2/l2 compressed sensing; without iterating. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 289–297, 2019.
  126. (nearly) sample-optimal sparse fourier transform in any dimension; ripless and filterless. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1568–1577. IEEE, 2019.
  127. Beyond convexity: Submodularity in machine learning. ICML Tutorials, 2008.
  128. Jeff Bilmes. Submodularity in machine learning applications. In Twenty-Ninth Conference on Artificial Intelligence, AAAI-15 Tutorial Forum, 2015.
  129. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294, 1978.
  130. Fast submodular function maximization. arXiv preprint arXiv:2305.08367, 2023.
  131. Infoprompt: Information-theoretic soft prompt tuning for natural language understanding. arXiv preprint arXiv:2306.04933, 2023.
  132. The closeness of in-context learning and weight shifting for softmax regression. arXiv preprint, 2023.
  133. Solving linear programs in the current matrix multiplication time. In STOC, 2019.
  134. Solving empirical risk minimization in the current matrix multiplication time. In Conference on Learning Theory, pages 2140–2157. PMLR, 2019.
  135. A faster interior point method for semidefinite programming. In 2020 IEEE 61st annual symposium on foundations of computer science (FOCS), pages 910–918. IEEE, 2020.
  136. Oblivious sketching-based central path method for linear programming. In International Conference on Machine Learning, pages 9835–9847. PMLR, 2021.
  137. A faster algorithm for solving general lps. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 823–832, 2021.
  138. Solving sdp faster: A robust ipm framework and efficient implementation. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 233–244. IEEE, 2022.
  139. A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
  140. A nearly-linear time algorithm for structured support vector machines. arXiv preprint arXiv:2307.07735, 2023.
  141. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In AISTATS, 2023.
  142. Streaming semidefinite programs: O⁢(n)𝑂𝑛{O}(\sqrt{n})italic_O ( square-root start_ARG italic_n end_ARG ) passes, small space and fast runtime. arXiv preprint arXiv:2309.05135, 2023.
  143. Convex minimization with integer minima in O~⁢(n4)~𝑂superscript𝑛4\widetilde{O}(n^{4})over~ start_ARG italic_O end_ARG ( italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) time. In ACM-SIAM Symposium on Discrete Algorithms (SODA), 2024.
  144. Space-efficient interior point method, with applications to linear programming and maximum weight bipartite matching. In ICALP. arXiv preprint arXiv:2009.06106, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Zhenyu Zhang (249 papers)
  2. Ying Sheng (31 papers)
  3. Tianyi Zhou (172 papers)
  4. Tianlong Chen (202 papers)
  5. Lianmin Zheng (34 papers)
  6. Ruisi Cai (11 papers)
  7. Zhao Song (253 papers)
  8. Yuandong Tian (128 papers)
  9. Christopher Ré (194 papers)
  10. Clark Barrett (86 papers)
  11. Zhangyang Wang (374 papers)
  12. Beidi Chen (61 papers)
Citations (140)
X Twitter Logo Streamline Icon: https://streamlinehq.com