Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads (2407.15891v1)

Published 22 Jul 2024 in cs.LG and cs.CL

Abstract: The memory and computational demands of Key-Value (KV) cache present significant challenges for deploying long-context LLMs. Previous approaches attempt to mitigate this issue by selectively dropping tokens, which irreversibly erases critical information that might be needed for future queries. In this paper, we propose a novel compression technique for KV cache that preserves all token information. Our investigation reveals that: i) Most attention heads primarily focus on the local context; ii) Only a few heads, denoted as retrieval heads, can essentially pay attention to all input tokens. These key observations motivate us to use separate caching strategy for attention heads. Therefore, we propose RazorAttention, a training-free KV cache compression algorithm, which maintains a full cache for these crucial retrieval heads and discards the remote tokens in non-retrieval heads. Furthermore, we introduce a novel mechanism involving a "compensation token" to further recover the information in the dropped tokens. Extensive evaluations across a diverse set of LLMs demonstrate that RazorAttention achieves a reduction in KV cache size by over 70% without noticeable impacts on performance. Additionally, RazorAttention is compatible with FlashAttention, rendering it an efficient and plug-and-play solution that enhances LLM inference efficiency without overhead or retraining of the original model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hanlin Tang (34 papers)
  2. Yang Lin (34 papers)
  3. Jing Lin (52 papers)
  4. Qingsen Han (1 paper)
  5. Shikuan Hong (1 paper)
  6. Yiwu Yao (11 papers)
  7. Gongyi Wang (5 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com