Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries (2412.08890v1)

Published 12 Dec 2024 in cs.LG

Abstract: We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.

Summary

The paper introduces LEXSICO, a novel method that compresses KV caches in LLMs via sparse coding with a universal dictionary.
It demonstrates up to 85% memory reduction while maintaining 90–95% of the original model performance, outperforming 2-bit quantization.
The method offers flexible compression ratios and an off-the-shelf solution for deploying large models in resource-constrained environments.

Overview of "Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries"

The paper investigates the problem of Key-Value (KV) cache compression in LLMs, which is essential for improving memory efficiency during model deployment without compromising performance. The authors propose a novel method, termed LEXSICO, which leverages sparse coding with universal dictionaries to achieve extreme KV cache compression.

Proposed Approach: LEXSICO

The core idea of LEXSICO is to utilize a sparse representation of KV caches through a universal dictionary for efficient compression. This dictionary is input-agnostic and contains approximately 4,000 atoms which provide a basis for representing KV caches. Key concepts are as follows:

Sparse Representation and Universal Dictionary: Each entry in a KV cache can be approximated as a sparse linear combination of dictionary atoms. Orthogonal Matching Pursuit (OMP) is utilized to determine this sparse approximation, allowing for direct control over the sparsity level and, consequently, the compression ratio.
Flexibility of Compression Ratios: The method allows for flexible control over compression ratios, enabling trade-offs between memory savings and performance. This aspect is crucial, particularly when working in low-memory regimes where other methods may fail to maintain accuracy.
Practical Implementation: The dictionary is pre-trained and universally applicable across various models, tasks, and input prompts. As a result, LEXSICO offers an off-the-shelf solution for compressing KV caches.

Performance and Evaluation

The proposed method is evaluated on different benchmark tasks and compared against existing state-of-the-art approaches, including quantization methods and token eviction strategies. Key findings include:

Maintained Performance at High Compression Rates: On challenging tasks such as GSM8K, LEXSICO retains 90-95% of the original model performance while using only 15-25% of the KV-cache memory. This represents a significant improvement over baseline methods, both in terms of compression rates and accuracy retention.
Advantage over 2-bit Quantization: In particularly memory-constrained scenarios, LEXSICO outperforms 2-bit quantization, achieving a 1.7× better compression while maintaining higher accuracy.

Implications and Future Directions

This work opens several avenues for future research and development:

Adaptive Dictionary Learning: Although current universal dictionaries are effective, adaptive learning mechanisms could further optimize dictionary construction by incorporating input-specific context during generation.
Latency Considerations: While the method achieves impressive compression, the additional computational overhead associated with OMP needs further optimization, particularly for latency-critical applications. Exploring alternative sparse coding algorithms or hardware optimizations could be valuable.
Combination with Other Techniques: Integrating LEXSICO with other memory-saving techniques, such as attention pruning or dynamic context trimming, may offer compounded benefits, enabling larger LLMs to run on limited hardware.

In summary, the paper presents a sophisticated method for KV cache compression in LLMs, achieving substantial memory efficiency gains while maintaining performance. This approach offers practical implications for deploying large models in resource-constrained environments, with potential extensions that could drive further advancements in the field.