Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
87 tokens/sec
Gemini 2.5 Pro Premium
36 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
39 tokens/sec
GPT-4o
95 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
460 tokens/sec
Kimi K2 via Groq Premium
219 tokens/sec
2000 character limit reached

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference (2403.09636v2)

Published 14 Mar 2024 in cs.CL

Abstract: Transformers have emerged as the backbone of LLMs. However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA) and key-value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. Hence, DMC can serve as a drop-in replacement for KV caching in existing LLMs to fit longer contexts and larger batches within any given memory budget.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. GQA: Training generalized multi-query transformer models from multi-head checkpoints. ArXiv, abs/2305.13245, 2023.
  2. Dynamic context pruning for efficient and interpretable autoregressive transformers. ArXiv, abs/2305.15805, 2023.
  3. Neural machine translation by jointly learning to align and translate. ArXiv, abs/1409.0473, 2014.
  4. Longformer: The long-document transformer. ArXiv, abs/2004.05150, 2020.
  5. PIQA: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), Apr. 2020.
  6. Token merging: Your ViT but faster. ArXiv, abs/2210.09461, 2022.
  7. Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021.
  8. Generating long sequences with sparse transformers. ArXiv, abs/1904.10509, 2019.
  9. Rethinking attention with Performers. ArXiv, abs/2009.14794, 2020.
  10. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  11. Think you have solved question answering? try ARC, the AI2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
  12. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35. Curran Associates, Inc., 2022.
  13. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023.
  14. Model tells you what to discard: Adaptive kv cache compression for llms. ArXiv, abs/2310.01801, 2023.
  15. Mamba: Linear-time sequence modeling with selective state spaces. ArXiv, abs/2312.00752, 2023.
  16. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
  17. The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2019.
  18. Mistral 7B. ArXiv, abs/2310.06825, 2023.
  19. Length-adaptive Transformer: Train once with length drop, use anytime with search. In Annual Meeting of the Association for Computational Linguistics, 2020.
  20. Efficient memory management for large language model serving with pagedattention. Proceedings of the 29th Symposium on Operating Systems Principles, 2023.
  21. Generating wikipedia by summarizing long sequences. ArXiv, abs/1801.10198, 2018.
  22. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. ArXiv, abs/2305.17118, 2023.
  23. Learning to compress prompts with gist tokens. ArXiv, abs/2304.08467, 2023.
  24. Efficient large-scale language model training on gpu clusters using megatron-lm. SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
  25. Efficient transformers with dynamic token pooling. In Annual Meeting of the Association for Computational Linguistics, 2022.
  26. Carbon emissions and large neural network training. ArXiv, abs/2104.10350, 2021.
  27. Efficiently scaling transformer inference. ArXiv, abs/2211.05102, 2022.
  28. Compressive transformers for long-range sequence modelling. ArXiv, abs/1911.05507, 2019.
  29. WinoGrande: An adversarial winograd schema challenge at scale. Commun. ACM, 64(9), 2021.
  30. Shazeer, N. M. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150, 2019.
  31. High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, 2023.
  32. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
  33. Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 11, 2022.
  34. Attention is all you need. In Neural Information Processing Systems, 2017.
  35. Spatten: Efficient sparse attention architecture with cascade token and head pruning. 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2020.
  36. HellaSwag: Can a machine really finish your sentence? In Korhonen, A., Traum, D., and Màrquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019. Association for Computational Linguistics.
  37. Accelerating neural transformer via an average attention network. ArXiv, abs/1805.00631, 2018.
  38. H2o: Heavy-hitter oracle for efficient generative inference of large language models. ArXiv, abs/2306.14048, 2023.
Citations (28)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces DMC, a dynamic compression technique that retrofits LLMs to reduce cache memory by up to 4x while preserving model performance.
  • It details an online method that learns layer- and head-specific compression rates and adjusts key-value storage dynamically during inference.
  • The research demonstrates that DMC boosts inference throughput by approximately 3.7x, enabling more efficient processing in resource-constrained environments.

Retrofitting LLMs for Efficient Inference with Dynamic Memory Compression

Introduction to Dynamic Memory Compression

LLMs like GPT and BERT have become central to many NLP tasks. However, their deployment is limited by inefficiencies, particularly during inference when the Transformer architecture's need for storing past token representations in a cache becomes memory-intensive. This paper introduces Dynamic Memory Compression (DMC), a technique aimed at compressing this key-value cache in Transformers. Unlike previous methods that trade off performance for efficiency, DMC compresses the cache dynamically, learning to adjust compression rates across different heads and layers without adding extra parameters or significantly sacrificing performance.

Key Contributions

The research presents several noteworthy contributions:

  • DMC's Novel Approach: DMC employs on-line compression during inference, compressing the cache based on the content dynamics. This approach contrasts with fixed compression rates or token-pruning strategies, offering a more flexible and context-sensitive solution.
  • Preserved Model Performance: When retrofitting existing pre-trained LLMs like Llama 2 (7B, 13B, and 70B) with DMC, the models maintain their original downstream task performance even with up to 4x cache compression. This preservation of performance is achieved with minimal additional pre-training.
  • Compatibility with Grouped Query Attention: For models already utilizing Grouped Query Attention (GQA), DMC demonstrates compounded gains when combined, showcasing its broad applicability and efficiency improvements.
  • Insights on Internal Model Structure: The learned compression schema reveals preferences for compressing higher layers, offering new insights into the model's internal information processing.

Methodology and Results

The DMC method involves a two-step process during pre-training: dynamically deciding whether to append or compress current token representations based on a learned importance score and then accurately mimicking this behavior during inference. Through this process, DMC LLMs achieve significant throughput increases (up to ~3.7x) compared to the original models without performance degradation.

Comparative analysis with GQA highlighted DMC's superiority in both sample efficiency and final task performance, establishing it as a preferable choice for efficient Transformer deployment. The research further illustrated that DMC's benefits extend to various model scales and compression targets.

Implications and Future Directions

The development of DMC presents practical implications for deploying LLMs in resource-constrained environments. By reducing the memory load of the key-value cache, DMC enables longer context processing and larger batch sizes within the same memory budget, facilitating faster and more efficient inference.

This work opens various avenues for future exploration. Investigating DMC's applicability to a broader range of model architectures and tasks, its synergies with other efficiency-enhancing techniques, and deeper analysis of the learned compression schemata can provide further insights into making LLMs more accessible and environmentally sustainable.

Conclusion

Dynamic Memory Compression represents a significant step towards addressing the efficiency challenges of deploying LLMs in practical applications. By preserving model performance while substantially reducing the memory footprint, DMC paves the way for wider adoption and utility of LLMs across diverse computational settings.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com