RazorAttention: Efficient KV Cache Compression Through Retrieval Heads (2407.15891v1)

Published 22 Jul 2024 in cs.LG and cs.CL

Abstract: The memory and computational demands of Key-Value (KV) cache present significant challenges for deploying long-context LLMs. Previous approaches attempt to mitigate this issue by selectively dropping tokens, which irreversibly erases critical information that might be needed for future queries. In this paper, we propose a novel compression technique for KV cache that preserves all token information. Our investigation reveals that: i) Most attention heads primarily focus on the local context; ii) Only a few heads, denoted as retrieval heads, can essentially pay attention to all input tokens. These key observations motivate us to use separate caching strategy for attention heads. Therefore, we propose RazorAttention, a training-free KV cache compression algorithm, which maintains a full cache for these crucial retrieval heads and discards the remote tokens in non-retrieval heads. Furthermore, we introduce a novel mechanism involving a "compensation token" to further recover the information in the dropped tokens. Extensive evaluations across a diverse set of LLMs demonstrate that RazorAttention achieves a reduction in KV cache size by over 70% without noticeable impacts on performance. Additionally, RazorAttention is compatible with FlashAttention, rendering it an efficient and plug-and-play solution that enhances LLM inference efficiency without overhead or retraining of the original model.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (7)

Hanlin Tang (34 papers)
Yang Lin (34 papers)
Jing Lin (52 papers)
Qingsen Han (1 paper)
Shikuan Hong (1 paper)
Yiwu Yao (11 papers)
Gongyi Wang (5 papers)

Citations (11)

View on Semantic Scholar

Tweets

https://twitter.com/gm8xx8/status/1815935843502014842

YouTube

Show All Videos

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads (2407.15891v1)

Related Papers

Tweets

YouTube