Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference (2403.09636v2)

Published 14 Mar 2024 in cs.CL

Abstract: Transformers have emerged as the backbone of LLMs. However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA) and key-value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. Hence, DMC can serve as a drop-in replacement for KV caching in existing LLMs to fit longer contexts and larger batches within any given memory budget.

References (38)

Citations (28)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces DMC, a dynamic compression technique that retrofits LLMs to reduce cache memory by up to 4x while preserving model performance.
It details an online method that learns layer- and head-specific compression rates and adjusts key-value storage dynamically during inference.
The research demonstrates that DMC boosts inference throughput by approximately 3.7x, enabling more efficient processing in resource-constrained environments.

Retrofitting LLMs for Efficient Inference with Dynamic Memory Compression

Introduction to Dynamic Memory Compression

LLMs like GPT and BERT have become central to many NLP tasks. However, their deployment is limited by inefficiencies, particularly during inference when the Transformer architecture's need for storing past token representations in a cache becomes memory-intensive. This paper introduces Dynamic Memory Compression (DMC), a technique aimed at compressing this key-value cache in Transformers. Unlike previous methods that trade off performance for efficiency, DMC compresses the cache dynamically, learning to adjust compression rates across different heads and layers without adding extra parameters or significantly sacrificing performance.

Key Contributions

The research presents several noteworthy contributions:

DMC's Novel Approach: DMC employs on-line compression during inference, compressing the cache based on the content dynamics. This approach contrasts with fixed compression rates or token-pruning strategies, offering a more flexible and context-sensitive solution.
Preserved Model Performance: When retrofitting existing pre-trained LLMs like Llama 2 (7B, 13B, and 70B) with DMC, the models maintain their original downstream task performance even with up to 4x cache compression. This preservation of performance is achieved with minimal additional pre-training.
Compatibility with Grouped Query Attention: For models already utilizing Grouped Query Attention (GQA), DMC demonstrates compounded gains when combined, showcasing its broad applicability and efficiency improvements.
Insights on Internal Model Structure: The learned compression schema reveals preferences for compressing higher layers, offering new insights into the model's internal information processing.

Methodology and Results

The DMC method involves a two-step process during pre-training: dynamically deciding whether to append or compress current token representations based on a learned importance score and then accurately mimicking this behavior during inference. Through this process, DMC LLMs achieve significant throughput increases (up to ~3.7x) compared to the original models without performance degradation.

Comparative analysis with GQA highlighted DMC's superiority in both sample efficiency and final task performance, establishing it as a preferable choice for efficient Transformer deployment. The research further illustrated that DMC's benefits extend to various model scales and compression targets.

Implications and Future Directions

The development of DMC presents practical implications for deploying LLMs in resource-constrained environments. By reducing the memory load of the key-value cache, DMC enables longer context processing and larger batch sizes within the same memory budget, facilitating faster and more efficient inference.

This work opens various avenues for future exploration. Investigating DMC's applicability to a broader range of model architectures and tasks, its synergies with other efficiency-enhancing techniques, and deeper analysis of the learned compression schemata can provide further insights into making LLMs more accessible and environmentally sustainable.

Conclusion

Dynamic Memory Compression represents a significant step towards addressing the efficiency challenges of deploying LLMs in practical applications. By preserving model performance while substantially reducing the memory footprint, DMC paves the way for wider adoption and utility of LLMs across diverse computational settings.

PDF Markdown

Follow-up Questions

Related Papers

Authors (5)

Tweets

https://twitter.com/s_scardapane/status/1770820159290200275

https://twitter.com/p_nawrot/status/1780652831512404303

https://twitter.com/hillbig/status/1776755722786832754

https://twitter.com/PontiEdoardo/status/1777662446729036208

https://twitter.com/bminixhofer/status/1790314555190509868

https://twitter.com/gm8xx8/status/1768452486774276474