Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference (2403.09636v2)
Abstract: Transformers have emerged as the backbone of LLMs. However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA) and key-value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. Hence, DMC can serve as a drop-in replacement for KV caching in existing LLMs to fit longer contexts and larger batches within any given memory budget.
- GQA: Training generalized multi-query transformer models from multi-head checkpoints. ArXiv, abs/2305.13245, 2023.
- Dynamic context pruning for efficient and interpretable autoregressive transformers. ArXiv, abs/2305.15805, 2023.
- Neural machine translation by jointly learning to align and translate. ArXiv, abs/1409.0473, 2014.
- Longformer: The long-document transformer. ArXiv, abs/2004.05150, 2020.
- PIQA: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), Apr. 2020.
- Token merging: Your ViT but faster. ArXiv, abs/2210.09461, 2022.
- Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021.
- Generating long sequences with sparse transformers. ArXiv, abs/1904.10509, 2019.
- Rethinking attention with Performers. ArXiv, abs/2009.14794, 2020.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Think you have solved question answering? try ARC, the AI2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35. Curran Associates, Inc., 2022.
- Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023.
- Model tells you what to discard: Adaptive kv cache compression for llms. ArXiv, abs/2310.01801, 2023.
- Mamba: Linear-time sequence modeling with selective state spaces. ArXiv, abs/2312.00752, 2023.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
- The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2019.
- Mistral 7B. ArXiv, abs/2310.06825, 2023.
- Length-adaptive Transformer: Train once with length drop, use anytime with search. In Annual Meeting of the Association for Computational Linguistics, 2020.
- Efficient memory management for large language model serving with pagedattention. Proceedings of the 29th Symposium on Operating Systems Principles, 2023.
- Generating wikipedia by summarizing long sequences. ArXiv, abs/1801.10198, 2018.
- Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. ArXiv, abs/2305.17118, 2023.
- Learning to compress prompts with gist tokens. ArXiv, abs/2304.08467, 2023.
- Efficient large-scale language model training on gpu clusters using megatron-lm. SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
- Efficient transformers with dynamic token pooling. In Annual Meeting of the Association for Computational Linguistics, 2022.
- Carbon emissions and large neural network training. ArXiv, abs/2104.10350, 2021.
- Efficiently scaling transformer inference. ArXiv, abs/2211.05102, 2022.
- Compressive transformers for long-range sequence modelling. ArXiv, abs/1911.05507, 2019.
- WinoGrande: An adversarial winograd schema challenge at scale. Commun. ACM, 64(9), 2021.
- Shazeer, N. M. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150, 2019.
- High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, 2023.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
- Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 11, 2022.
- Attention is all you need. In Neural Information Processing Systems, 2017.
- Spatten: Efficient sparse attention architecture with cascade token and head pruning. 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2020.
- HellaSwag: Can a machine really finish your sentence? In Korhonen, A., Traum, D., and Màrquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019. Association for Computational Linguistics.
- Accelerating neural transformer via an average attention network. ArXiv, abs/1805.00631, 2018.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. ArXiv, abs/2306.14048, 2023.
Collections
Sign up for free to add this paper to one or more collections.