Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 39 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 12 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 456 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance (2502.16886v2)

Published 24 Feb 2025 in cs.CL and cs.AI

Abstract: To alleviate memory burden during inference of LLMs, numerous studies have focused on compressing the KV cache by exploring aspects such as attention sparsity. These techniques are often designed with a pre-defined KV budget; however, as the optimal budget varies by different input lengths and task types, the existence of a fixed budget could result in inconsistent performance accepting inputs of diverse domains. To address this limitation, we propose a new KV cache compression objective: to always ensure the full-cache performance regardless of specific inputs, while maximizing KV cache pruning as much as possible. To achieve this goal, we introduce a novel KV cache compression method dubbed DBudgetKV, which features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Empirical evaluation spanning diverse context lengths, task types, and model sizes suggests that our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average. Furthermore, our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.