- The paper introduces SAGE-KV, a novel KV cache eviction method that uses self-attention scores to dynamically identify and retain important tokens during long-context LLM inference.
- Compared to methods like StreamLLM and Quest, SAGE-KV demonstrates significantly higher memory efficiency (up to 4x) while maintaining or improving accuracy on benchmark tasks.
- This attention-guided approach allows LLMs to handle longer contexts more efficiently by reducing KV cache size without substantial performance degradation, enabling practical applications.
The paper "LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference" (2503.08879) introduces SAGE-KV (Self-Attention Guided Eviction for KV Cache), a method designed to improve the efficiency of long-context inference in LLMs by addressing the memory and latency bottlenecks associated with large KV caches and attention computation. The core idea is based on the observation that LLMs can identify less important tokens at the head level after the pre-filling stage.
The SAGE-KV method operates as follows:
- After the initial pre-filling, the full KV cache is partitioned into four segments: initial tokens (sink tokens), evicted tokens, recent tokens, and the last token's KV cache.
- Representative KV cache entries are selected from the "evicted tokens" based on the attention scores between the last token of the input sequence and the evicted tokens. Specifically, the query vector of the last token is used to select the top-k KV cache entries.
- A reduced KV cache is constructed by combining the sink token KV cache, the selected top-k token KV cache, the recent token KV cache, and the last KV cache.
- Output generation occurs using the reduced KV cache. As each new token is generated, its KV pair is added to the recent window, evicting the oldest entry in the recent window to maintain a fixed size. This process continues until generation is complete.
Token selection is based on the attention scores between the last token and the evicted tokens, with the top-k tokens having the highest attention scores being selected. Head selection involves a top-k selection at the head level, resulting in Hq groups of the top-k KV caches, where Hq represents the number of query heads.
SAGE-KV is compared to other KV cache management methods:
- StreamLLM: SAGE-KV achieves 4x higher memory efficiency with improved accuracy compared to StreamLLM. StreamLLM employs a static KV cache selection, retaining sink tokens and recent tokens, whereas SAGE-KV dynamically selects tokens based on attention scores.
- Quest: SAGE-KV achieves 2x higher memory efficiency with better accuracy than Quest. Quest uses a block-wise top-k selection strategy, while SAGE-KV employs token-level selection.
The method's performance is evaluated using the LongBench benchmark, which includes tasks like question answering, summarization, retrieval, and code analysis. Accuracy is measured across these tasks, and memory efficiency is assessed by measuring the token budget required to maintain performance.