Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling (2406.02069v3)

Published 4 Jun 2024 in cs.CL and cs.AI

Abstract: In this study, we investigate whether attention-based information flow inside LLMs is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC dataset. In the Needle-in-a-Haystack experiment, PyramidKV outperforms competing methods in maintaining long-context comprehension in LLMs; notably, retaining just 128 KV cache entries enables the LLAMA-3-70B model to achieve 100% Acc. performance, matching that of a full KV cache.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Yichi Zhang (184 papers)
  2. Bofei Gao (15 papers)
  3. Tianyu Liu (177 papers)
  4. Keming Lu (35 papers)
  5. Wayne Xiong (10 papers)
  6. Yue Dong (61 papers)
  7. Baobao Chang (80 papers)
  8. Junjie Hu (111 papers)
  9. Wen Xiao (32 papers)
  10. Zefan Cai (26 papers)
  11. Yuliang Liu (82 papers)
Citations (27)
X Twitter Logo Streamline Icon: https://streamlinehq.com