Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference (2503.08879v1)

Published 11 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Efficient long-context inference is critical as LLMs adopt context windows of ranging from 128K to 1M tokens. However, the growing key-value (KV) cache and the high computational complexity of attention create significant bottlenecks in memory usage and latency. In this paper, we find that attention in diverse long-context tasks exhibits sparsity, and LLMs implicitly "know" which tokens can be dropped or evicted at the head level after the pre-filling stage. Based on this insight, we propose Self-Attention Guided Eviction~(SAGE-KV), a simple and effective KV eviction cache method for long-context inference. After prefilling, our method performs a one-time top-k selection at both the token and head levels to compress the KV cache, enabling efficient inference with the reduced cache. Evaluations on LongBench and three long-context LLMs (Llama3.1-8B-Instruct-128k, Llama3-8B-Prolong-512k-Instruct, and Qwen2.5-7B-Instruct-128k) show that SAGE-KV maintains accuracy comparable to full attention while significantly improving efficiency. Specifically, SAGE-KV achieves 4x higher memory efficiency with improved accuracy over the static KV cache selection method StreamLLM, and 2x higher memory efficiency with better accuracy than the dynamic KV cache selection method Quest.

Summary

  • The paper introduces SAGE-KV, a novel KV cache eviction method that uses self-attention scores to dynamically identify and retain important tokens during long-context LLM inference.
  • Compared to methods like StreamLLM and Quest, SAGE-KV demonstrates significantly higher memory efficiency (up to 4x) while maintaining or improving accuracy on benchmark tasks.
  • This attention-guided approach allows LLMs to handle longer contexts more efficiently by reducing KV cache size without substantial performance degradation, enabling practical applications.

The paper "LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference" (2503.08879) introduces SAGE-KV (Self-Attention Guided Eviction for KV Cache), a method designed to improve the efficiency of long-context inference in LLMs by addressing the memory and latency bottlenecks associated with large KV caches and attention computation. The core idea is based on the observation that LLMs can identify less important tokens at the head level after the pre-filling stage.

The SAGE-KV method operates as follows:

  1. After the initial pre-filling, the full KV cache is partitioned into four segments: initial tokens (sink tokens), evicted tokens, recent tokens, and the last token's KV cache.
  2. Representative KV cache entries are selected from the "evicted tokens" based on the attention scores between the last token of the input sequence and the evicted tokens. Specifically, the query vector of the last token is used to select the top-k KV cache entries.
  3. A reduced KV cache is constructed by combining the sink token KV cache, the selected top-k token KV cache, the recent token KV cache, and the last KV cache.
  4. Output generation occurs using the reduced KV cache. As each new token is generated, its KV pair is added to the recent window, evicting the oldest entry in the recent window to maintain a fixed size. This process continues until generation is complete.

Token selection is based on the attention scores between the last token and the evicted tokens, with the top-k tokens having the highest attention scores being selected. Head selection involves a top-k selection at the head level, resulting in HqH_q groups of the top-kk KV caches, where HqH_q represents the number of query heads.

SAGE-KV is compared to other KV cache management methods:

  • StreamLLM: SAGE-KV achieves 4x higher memory efficiency with improved accuracy compared to StreamLLM. StreamLLM employs a static KV cache selection, retaining sink tokens and recent tokens, whereas SAGE-KV dynamically selects tokens based on attention scores.
  • Quest: SAGE-KV achieves 2x higher memory efficiency with better accuracy than Quest. Quest uses a block-wise top-k selection strategy, while SAGE-KV employs token-level selection.

The method's performance is evaluated using the LongBench benchmark, which includes tasks like question answering, summarization, retrieval, and code analysis. Accuracy is measured across these tasks, and memory efficiency is assessed by measuring the token budget required to maintain performance.