Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference (2402.09398v2)

Published 14 Feb 2024 in cs.LG and cs.AI

Abstract: Many computational factors limit broader deployment of LLMs. In this paper, we focus on a memory bottleneck imposed by the key-value (KV) cache, a computational shortcut that requires storing previous KV pairs during decoding. While existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs to dramatically reduce the memory footprint of the cache, they can have limited success in tasks that require recollecting a majority of previous tokens. To alleviate this issue, we propose LESS, a simple integration of a (nearly free) constant sized cache with eviction-based cache methods, such that all tokens can be queried at later decoding steps. Its ability to retain information throughout time shows merit on a variety of tasks where we demonstrate LESS can help reduce the performance gap from caching everything, sometimes even matching it, all while being efficient. Relevant code can be found at https://github.com/hdong920/LESS.

Authors (6)

Harry Dong (9 papers)
Xinyu Yang (109 papers)
Zhenyu Zhang (250 papers)
Zhangyang Wang (375 papers)
Yuejie Chi (109 papers)
Beidi Chen (61 papers)

Citations (32)

View on Semantic Scholar

Summary

Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference

The paper "Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference" addresses a pressing computational challenge associated with the deployment of LLMs: the memory constraints posed by the key-value (KV) cache during decoding. This work proposes LESS, a framework that integrates low-rank embeddings with sparse KV cache policies to alleviate KV cache memory bottlenecks while maintaining high performance across various tasks.

The KV cache in transformer-based LLMs, although crucial for reducing computation by storing previously computed keys and values, results in significant memory consumption, often surpassing the model's memory requirements. For instance, the Llama 2 7B model's KV cache for typical workloads can demand 64 GB, overshadowing the 26 GB required for the model parameters. This challenge is exacerbated as LLMs scale, posing practical limitations for broad deployment.

The proposed LESS methodology incorporates a constant-sized low-rank cache to capture and utilize information discarded by sparse policies, thereby achieving significant memory savings without substantial performance degradation. In particular, it bridges the performance gap between full KV caching and sparse-only approaches, and even improves end-to-end efficiency by reducing latency and increasing throughput.

Strong Results and Insights

LESS shows its strength through empirical assessments on LLMing, classification, and summarization tasks across various configurations, including different LLMs (e.g., Llama 2, Falcon), different sparse cache policies, and varying sparsity levels. Notably, it consistently outperforms similar-sized baselines in terms of perplexity and ROUGE scores—reducing gaps from full cache performance by significant margins.

For instance, in the context of LLMing on datasets like WikiText and PG-19, LESS delivers perplexity reductions exceeding 20% relative to baseline sparse methods. In summarization tasks, it significantly boosts ROUGE scores compared to sparse cache-only baselines, even recovering large proportions of the degradation seen in restricted caching settings.

From a computational efficiency standpoint, LESS showcases considerable improvements in both latency and throughput over full KV caching across different batch sizes and sequence lengths, particularly in settings constrained by hardware memory limitations.

Implications and Future Directions

The implications of LESS are multiple. Practically, it provides a scalable solution that offers meaningful reductions in computational resource requirements while maintaining high inference quality, thus enabling broader deployment of LLMs in memory-constrained environments. Theoretically, the integration of low-rank recurrent structures presents a valuable insight into marrying traditional RNN attributes with transformer efficiencies, harnessing the strengths of both architectures to address modern challenges in AI deployment.

Future research could focus on refining the low-rank approximation techniques utilized within LESS, potentially exploring alternative embeddings or kernel functions for further improvements. Additionally, expanding the framework's capabilities to other forms of data compression or exploring its adaptability across different types of token selection strategies beyond those tested could provide enhanced versatility and performance gains.

Overall, LESS represents a significant step in efficient LLM deployment, offering a compelling approach to mitigating memory bottlenecks while harnessing large-scale AI models' full potential.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/Real_HDong/status/1786456952299761697

https://twitter.com/mctalentowen/status/1801414635041788035