Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing (2410.18517v1)

Published 24 Oct 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The development of LLMs has significantly expanded model sizes, resulting in substantial GPU memory requirements during inference. The key and value storage of the attention map in the KV (key-value) cache accounts for more than 80\% of this memory consumption. Nowadays, most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer but few works consider layer-wise compression. In this paper, we propose a plug-and-play method called \textit{KVSharer}, which shares the KV cache between layers to achieve layer-wise compression. Rather than intuitively sharing based on higher similarity, we discover a counterintuitive phenomenon: sharing dissimilar KV caches better preserves the model performance. Experiments show that \textit{KVSharer} can reduce KV cache computation by 30\%, thereby lowering memory consumption without significantly impacting model performance and it can also achieve at least 1.3 times generation acceleration. Additionally, we verify that \textit{KVSharer} is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  2. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  3. Piqa: Reasoning about physical commonsense in natural language, 2019.
  4. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024.
  5. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  6. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  7. Head-wise shareable attention for large language models. arXiv preprint arXiv:2402.11819, 2024.
  8. What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices. arXiv preprint arXiv:2409.01893, 2024.
  9. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  10. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  11. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398, 2024.
  14. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.
  15. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  17. Skip-thought vectors. arXiv preprint arXiv:1506.06726, 2015.
  18. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
  19. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  20. Cmmlu: Measuring massive multitask language understanding in chinese, 2023.
  21. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024.
  22. Minicache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366, 2024a.
  23. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024b.
  24. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization, 2018.
  25. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. arXiv preprint arXiv:2101.00234, 2021.
  26. Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  27. Exploring attention map reuse for efficient transformer neural networks. arXiv preprint arXiv:2301.12444, 2023.
  28. Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics, 2020. URL https://arxiv.org/abs/1904.09679v3.
  29. You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254, 2024.
  30. Lessons on parameter sharing across layers in transformers. arXiv preprint arXiv:2104.06022, 2021.
  31. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  33. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  34. Layer-condensed kv cache for efficient inference of large language models. arXiv preprint arXiv:2405.10637, 2024.
  35. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  36. Sharing attention weights for fast transformer. arXiv preprint arXiv:1906.11024, 2019.
  37. Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986, 2020.
  38. Think: Thinner key cache by query-driven pruning. arXiv preprint arXiv:2407.21018, 2024.
  39. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a.
  40. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532, 2024b.
  41. Laco: Large language model pruning via layer collapse. arXiv preprint arXiv:2402.11187, 2024c.
  42. Lazyformer: Self attention with lazy update. arXiv preprint arXiv:2102.12702, 2021.
  43. Subgen: Token generation in sublinear time and memory. arXiv preprint arXiv:2402.06082, 2024.
  44. Hellaswag: Can a machine really finish your sentence?, 2019.
  45. Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache. Proceedings of Machine Learning and Systems, 6:381–394, 2024a.
  46. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024b.
  47. ChID: A large-scale Chinese IDiom dataset for cloze test. In ACL, 2019.
Citations (2)

Summary

We haven't generated a summary for this paper yet.