KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing (2410.18517v1)
Abstract: The development of LLMs has significantly expanded model sizes, resulting in substantial GPU memory requirements during inference. The key and value storage of the attention map in the KV (key-value) cache accounts for more than 80\% of this memory consumption. Nowadays, most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer but few works consider layer-wise compression. In this paper, we propose a plug-and-play method called \textit{KVSharer}, which shares the KV cache between layers to achieve layer-wise compression. Rather than intuitively sharing based on higher similarity, we discover a counterintuitive phenomenon: sharing dissimilar KV caches better preserves the model performance. Experiments show that \textit{KVSharer} can reduce KV cache computation by 30\%, thereby lowering memory consumption without significantly impacting model performance and it can also achieve at least 1.3 times generation acceleration. Additionally, we verify that \textit{KVSharer} is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Piqa: Reasoning about physical commonsense in natural language, 2019.
- Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024.
- Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- Head-wise shareable attention for large language models. arXiv preprint arXiv:2402.11819, 2024.
- What are the essential factors in crafting effective long context multi-hop instruction datasets? insights and best practices. arXiv preprint arXiv:2409.01893, 2024.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398, 2024.
- Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Skip-thought vectors. arXiv preprint arXiv:1506.06726, 2015.
- Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
- The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
- Cmmlu: Measuring massive multitask language understanding in chinese, 2023.
- Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024.
- Minicache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366, 2024a.
- Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024b.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization, 2018.
- Subformer: Exploring weight sharing for parameter efficiency in generative transformers. arXiv preprint arXiv:2101.00234, 2021.
- Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
- Exploring attention map reuse for efficient transformer neural networks. arXiv preprint arXiv:2301.12444, 2023.
- Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics, 2020. URL https://arxiv.org/abs/1904.09679v3.
- You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254, 2024.
- Lessons on parameter sharing across layers in transformers. arXiv preprint arXiv:2104.06022, 2021.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Layer-condensed kv cache for efficient inference of large language models. arXiv preprint arXiv:2405.10637, 2024.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- Sharing attention weights for fast transformer. arXiv preprint arXiv:1906.11024, 2019.
- Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986, 2020.
- Think: Thinner key cache by query-driven pruning. arXiv preprint arXiv:2407.21018, 2024.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a.
- Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532, 2024b.
- Laco: Large language model pruning via layer collapse. arXiv preprint arXiv:2402.11187, 2024c.
- Lazyformer: Self attention with lazy update. arXiv preprint arXiv:2102.12702, 2021.
- Subgen: Token generation in sublinear time and memory. arXiv preprint arXiv:2402.06082, 2024.
- Hellaswag: Can a machine really finish your sentence?, 2019.
- Q-hitter: A better token oracle for efficient llm inference via sparse-quantized kv cache. Proceedings of Machine Learning and Systems, 6:381–394, 2024a.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024b.
- ChID: A large-scale Chinese IDiom dataset for cloze test. In ACL, 2019.