Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 76 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 206 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4 35 tok/s Pro
2000 character limit reached

KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing (2410.18517v1)

Published 24 Oct 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The development of LLMs has significantly expanded model sizes, resulting in substantial GPU memory requirements during inference. The key and value storage of the attention map in the KV (key-value) cache accounts for more than 80\% of this memory consumption. Nowadays, most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer but few works consider layer-wise compression. In this paper, we propose a plug-and-play method called \textit{KVSharer}, which shares the KV cache between layers to achieve layer-wise compression. Rather than intuitively sharing based on higher similarity, we discover a counterintuitive phenomenon: sharing dissimilar KV caches better preserves the model performance. Experiments show that \textit{KVSharer} can reduce KV cache computation by 30\%, thereby lowering memory consumption without significantly impacting model performance and it can also achieve at least 1.3 times generation acceleration. Additionally, we verify that \textit{KVSharer} is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  2. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  3. Piqa: Reasoning about physical commonsense in natural language, 2019.
  4. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024.
  5. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  6. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  7. Head-wise shareable attention for large language models. arXiv preprint arXiv:2402.11819, 2024.
  8. What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices. arXiv preprint arXiv:2409.01893, 2024.
  9. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  10. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  11. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398, 2024.
  14. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.
  15. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  17. Skip-thought vectors. arXiv preprint arXiv:1506.06726, 2015.
  18. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
  19. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  20. Cmmlu: Measuring massive multitask language understanding in chinese, 2023.
  21. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024.
  22. Minicache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366, 2024a.
  23. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024b.
  24. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization, 2018.
  25. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. arXiv preprint arXiv:2101.00234, 2021.
  26. Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  27. Exploring attention map reuse for efficient transformer neural networks. arXiv preprint arXiv:2301.12444, 2023.
  28. Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics, 2020. URL https://arxiv.org/abs/1904.09679v3.
  29. You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254, 2024.
  30. Lessons on parameter sharing across layers in transformers. arXiv preprint arXiv:2104.06022, 2021.
  31. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  33. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  34. Layer-condensed kv cache for efficient inference of large language models. arXiv preprint arXiv:2405.10637, 2024.
  35. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  36. Sharing attention weights for fast transformer. arXiv preprint arXiv:1906.11024, 2019.
  37. Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986, 2020.
  38. Think: Thinner key cache by query-driven pruning. arXiv preprint arXiv:2407.21018, 2024.
  39. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a.
  40. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532, 2024b.
  41. Laco: Large language model pruning via layer collapse. arXiv preprint arXiv:2402.11187, 2024c.
  42. Lazyformer: Self attention with lazy update. arXiv preprint arXiv:2102.12702, 2021.
  43. Subgen: Token generation in sublinear time and memory. arXiv preprint arXiv:2402.06082, 2024.
  44. Hellaswag: Can a machine really finish your sentence?, 2019.
  45. Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache. Proceedings of Machine Learning and Systems, 6:381–394, 2024a.
  46. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024b.
  47. ChID: A large-scale Chinese IDiom dataset for cloze test. In ACL, 2019.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.