Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CORM: Cache Optimization with Recent Message for Large Language Model Inference (2404.15949v2)

Published 24 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs, despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory used by the KV cache rises linearly with sequence length, becoming a primary bottleneck for inference. In this paper, we introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. Upon thorough investigation, we discover that in most Transformer models, (i) there is a striking similarity between adjacent tokens' query vectors, and (ii) the attention calculation of the current query can rely exclusively on the attention information of a small fraction of preceding queries. Based on these observations, we present CORM, a KV cache eviction policy that dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70\% with negligible performance degradation across six tasks in LongBench. Furthermore, we demonstrate that CORM is compatible with GQA for further compression rate.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  3. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  4. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022.
  5. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  6. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  7. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  8. Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  9. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  10. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  11. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024.
  12. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.
  13. Transformers are multi-state rnns. arXiv preprint arXiv:2401.06104, 2024.
  14. On the efficacy of eviction policy for key-value constrained generative language model inference. arXiv preprint arXiv:2402.06262, 2024.
  15. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  16. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  17. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  18. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  19. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018.
  20. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021.
  21. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
  22. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, 2020.
  23. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022.
  24. Dureader: a chinese machine reading comprehension dataset from real-world applications. arXiv preprint arXiv:1711.05073, 2017.
  25. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112, 2021.
  26. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021.
  27. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019.
  28. Vcsum: A versatile chinese meeting summarization dataset. arXiv preprint arXiv:2305.05280, 2023.
  29. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002.
  30. Task definition for large scale text categorization at nlpcc 2014. http://tcci.ccf.org.cn/conference/2014/dldoc/evatask6.pdf, 2014.
  31. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019.
  32. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  33. Longcoder: A long-range pre-trained language model for code completion. In International Conference on Machine Learning, pages 12098–12107. PMLR, 2023.
  34. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jincheng Dai (40 papers)
  2. Zhuowei Huang (1 paper)
  3. Haiyun Jiang (34 papers)
  4. Chen Chen (752 papers)
  5. Deng Cai (181 papers)
  6. Wei Bi (62 papers)
  7. Shuming Shi (126 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com