Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference (2403.09054v2)

Published 14 Mar 2024 in cs.LG, cs.AI, cs.AR, and cs.CL

Abstract: Transformers have emerged as the underpinning architecture for LLMs. In generative LLMs, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Ad-rec: Advanced feature interactions to address covariate-shifts in recommendation networks. arXiv preprint arXiv:2308.14902, 2023.
  2. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  3. Falcon-40b: an open large language model with state-of-the-art performance. Findings of the Association for Computational Linguistics: ACL, 2023:10755–10773, 2023.
  4. Dynamic context pruning for efficient and interpretable autoregressive transformers. arXiv preprint arXiv:2305.15805, 2023.
  5. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  6. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  7. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307, 2023.
  10. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  11. Cooray, K. Generalized gumbel distribution. Journal of Applied Statistics, 37(1):171–179, 2010.
  12. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  13. Transformers4rec: Bridging the gap between nlp and sequential/session-based recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems, pp.  143–153, 2021.
  14. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster, 2023.
  15. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  17. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 2021.
  18. Hiclip: Contrastive language-image pretraining with hierarchy-aware attention. arXiv preprint arXiv:2303.02995, 2023.
  19. Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, pp. 3690–3699. PMLR, 2020.
  20. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112, 2021a.
  21. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1419–1436, Online, June 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.112. URL https://aclanthology.org/2021.naacl-main.112.
  22. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  23. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156–5165. PMLR, 2020.
  24. Soda: Million-scale dialogue distillation with social commonsense contextualization. ArXiv, abs/2212.10465, 2022.
  25. Kitaev, N. Kaiser, łl., levskaya, a.: Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  26. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180, 2023.
  27. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  28. How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
  29. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  30. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp. 22137–22176. PMLR, 2023.
  31. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
  32. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  33. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
  34. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467, 2023.
  35. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  36. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  38. Mlperf inference benchmark, 2020.
  39. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
  40. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  41. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL https://www.aclweb.org/anthology/P17-1099.
  42. Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  43. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pp. 31094–31116. PMLR, 2023.
  44. Roformer: Enhanced transformer with rotary position embedding, 2022.
  45. Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799, 2019.
  46. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management, pp.  1441–1450, 2019.
  47. Team, M. N. et al. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www. mosaicml. com/blog/mpt-7b. Accessed, pp.  05–05, 2023.
  48. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  49. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  50. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp.  97–110. IEEE, 2021.
  51. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  52. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  53. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  54. El-attention: Memory efficient lossless attention for generation. In International Conference on Machine Learning, pp. 11648–11658. PMLR, 2021.
  55. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  56. H _⁢2_2\_2_ 2 o: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.
  57. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Muhammad Adnan (15 papers)
  2. Akhil Arunkumar (2 papers)
  3. Gaurav Jain (9 papers)
  4. Prashant J. Nair (21 papers)
  5. Ilya Soloveychik (29 papers)
  6. Purushotham Kamath (2 papers)
Citations (28)
X Twitter Logo Streamline Icon: https://streamlinehq.com