Papers
Topics
Authors
Recent
Search
2000 character limit reached

KVMerger: Adaptive KV Cache Compression

Updated 17 March 2026
  • KVMerger is a similarity-driven key–value cache compression algorithm that merges contiguous tokens with high cosine similarity to reduce memory usage during LLM inference.
  • It leverages Gaussian kernel weighted merging and empirical key similarity phenomena to fuse clusters while preserving crucial contextual information without retraining the model.
  • Empirical evaluations show that KVMerger maintains high model accuracy and throughput, outperforming standard eviction and quantization methods at significant compression rates.

KVMerger is a family of adaptive key–value (KV) cache compression algorithms designed for LLM inference under constrained memory budgets. It targets the exponential growth of KV cache during autoregressive decoding, enabling longer context windows, higher throughput, and lower system memory requirements without sacrificing model accuracy. KVMerger exploits the empirical and often theoretical observation that key vectors produced by LLM attention modules exhibit strong intra-sequence similarity, enabling their compressive merging via similarity-aware, information-preserving strategies. By favoring information fusion over simple eviction, KVMerger and its derivatives represent a significant advancement over prior pure-dropping or quantization-based approaches in long-context LLM serving.

1. Motivation and Context

The KV cache stores hidden states (keys and values) for every token at each attention layer and head during incremental decoding. This memory grows linearly with sequence length TT, number of layers LL, and number of heads HH, causing prohibitive GPU memory requirements for large-scale inference (e.g., over 1.2 TB for GPT-3 with T=4096T=4096 and batch=64batch=64) (Wang et al., 2024, Liu et al., 13 Mar 2025). Standard solutions like quantization and cache eviction reduce memory, but the latter irreversibly discards context, harming quality on tasks that require access to long-range dependencies. Merging-based strategies, such as KVMerger, address this by fusing similar KV entries many-to-one, retaining more contextual information for future inference.

Two key empirical phenomena motivate KVMerger (Wang et al., 2024):

  • Token-level key similarity: Adjacent key vectors frequently have cosine similarity >0.9>0.9 across heads/layers (notably under rotary positional encoding), yielding strong “locality” and justifying their fusion.
  • Model-level KV sparsity: When merged according to a global similarity threshold, the resulting compression ratio is stable across datasets, tasks, and even layers, indicating that a model-specific threshold can reliably target desired compression rates.

2. Merging Set Identification and Gaussian-Kernel Weighted Merging

KVMerger merges “clusters” (merging sets) of consecutive tokens whose key similarity exceeds a tunable threshold ϵ\epsilon (default 0.75), as measured by cosine similarity

δ(ki,kj)=kikjkikj.\delta(k_i, k_j) = \frac{k_i \cdot k_j}{\|k_i\|\|k_j\|}.

Merging set identification is formulated as greedy agglomerative hierarchical clustering with the locality constraint that clusters are contiguous (Wang et al., 2024).

For each merging set S\mathcal{S}, KVMerger selects a pivotal token pp with maximal aggregated attention and computes Gaussian kernel affinities

gpi=exp(kpki22σ2)g_{pi} = \exp\left(-\frac{\|k_p - k_i\|^2}{2\sigma^2}\right)

(normalized to weights wiw_i). The merged key and value at position pp are then

kp=iSwiki,vp=iSwivik_p^* = \sum_{i \in \mathcal{S}} w_i k_i, \qquad v_p^* = \sum_{i \in \mathcal{S}} w_i v_i

with bandwidth σ\sigma set adaptively per group. This kernel-based fusion preserves local information, yielding low compression-induced degradation.

Exclusion heuristics are employed: heavy-hitter tokens (top attention per group) and a recent-token window are unmerged to prevent critical information loss. Merging is applied to the remaining slots, ensuring compliance with a target budget (e.g., 35% or 50% of original size) (Wang et al., 2024).

3. Memory, Computational Complexity, and Implementation

KVMerger achieves substantial memory savings. For a typical 50% budget, about 17% of slots are preserved for most-recent tokens, 12% for attention heavy-hitters, and the remaining \sim71% are merged into 35% of slots, for a net \sim50% retention. At 35% budget, even higher merging rates are possible (Wang et al., 2024).

The merging set identification and fusion are O(T)\mathcal{O}(T) per layer/head, negligible compared to self-attention's O(T2d)\mathcal{O}(T^2d) complexity. The overall system overhead is modest relative to baseline attention modules.

Integration is straightforward: after each KV append, compression is triggered when the cache exceeds budget. Merging is applied per layer/head except potentially in the first and last layers, or under short-context regimes where key similarity structure is absent. KVMerger operates on the raw cache, requiring no retraining or alteration to model weights.

4. Empirical Performance and Ablation

Across Llama2-7B-chat, Llama2-13B-chat, and multiple long-context benchmarks (LongBench, ZeroScrolls, RAG retrieval), KVMerger consistently outperforms competing cache compression schemes such as H2O (eviction-based) and CaM (value-state merging plus eviction). For instance (Wang et al., 2024):

Method LongBench avg (7B) ZeroScrolls avg RAG Hit-rate Budget %
Full KV 35.22 15.30 Highest 100%
H2O 34.00 14.36 Lower 50%
CaM 33.67 14.24 Lower 50%
KVMerger 35.02 15.21 Highest 50%
KVMerger 34.50 14.84 Highest 35%

Ablations show best performance for per-group σ5\sigma \approx 5. Pivotal-token selection via top aggregated attention outperforms random choice. Aggressive thresholds (ϵ<0.6\epsilon < 0.6) can lead to over-compression and degraded accuracy, especially in early/late layers.

A similar “KVMerger”–style mechanism is embedded within ZeroMerge (ZSMerge) (Liu et al., 13 Mar 2025), which generalizes the approach with per-head importance allocation and residual merging, supporting parameter-free, zero-shot application across models and up to 20:1 compression (5% cache).

5. Theoretical and Algorithmic Limitations

While KVMerger significantly reduces memory and supports robust generation, it is subject to two notable limitations:

  • In merging, convex combinations of keys and values may induce “attention sag,” i.e., the attention mass of a merged key is strictly less than the sum of the originals, resulting in otot>0\|o'_t - o_t\| > 0 (Tian et al., 14 Apr 2025).
  • Purely similarity-driven clustering can, under low similarity thresholds, group semantically disparate tokens, introducing noise.

Later work, such as KeepKV (Tian et al., 14 Apr 2025), explicitly quantifies and eliminates this output drift by introducing vote-tracking and a zero-perturbation ZIP-Merging scheme, pushing per-step output deviation to zero and enabling provable guarantees on down-stream accuracy.

6. Relationship to Alternative Merging and Compression Approaches

KVMerger sits within an ecosystem of KV cache optimizers:

  • Pure Eviction (e.g., H2O, StreamingLLM): Selectively drops low-importance entries based on attention or positional heuristics, incurring irreversible information loss.
  • Parameterization and Learned Fusion (e.g., LESS): Learns data-driven fusion networks with additional tuning requirements.
  • Asymmetric/Hessian-based Merging (e.g., AsymKV, KVSlimmer): Employs rigorous second-order analysis to capture key–value asymmetry and derive mathematically principled fusion rules (Liu et al., 1 Mar 2026).
  • Zero-Shot Residual Merging (ZeroMerge/ZSMerge): Incorporates head-granular importance, momentum-based residual fusion, and softmax compensation, wholly eliminating the need for model retraining (Liu et al., 13 Mar 2025).
  • Zero-perturbation Merging (KeepKV): Introduces electoral votes and exact preservation of attention aggregates, theoretically eliminating stepwise output drift (Tian et al., 14 Apr 2025).

KVMerger occupies the intermediate space, balancing algorithmic simplicity, model-agnostic deployment, and high empirical efficacy, albeit with potential for small perturbations in output.

7. Practical Integration and Tuning

Optimal deployment of KVMerger involves:

  • Setting the similarity threshold ϵ0.75\epsilon\approx0.75 to achieve desired budget-compression trade-offs.
  • Excluding critical tokens—recent and attention-heavy—from merging.
  • Computing the Gaussian kernel bandwidth σ\sigma per merging set (typically close to 5).
  • Disabling or tightening merging in early and late layers, particularly for long-context tasks.
  • For extremely large models, grouping keys block-wise or using approximate similarity search for scalability.

Limitations arise with negligible similarity structure (short contexts, first/last layers), where merging has limited utility or may degrade model performance. Hyperparameters are typically robust to model and benchmark variation, but may require further calibration for best results in extreme compression regimes.


KVMerger has established itself as a baseline for practical, model-agnostic, similarity-driven KV cache compression for LLMs. Its core principles have been subsumed and extended in subsequent algorithmic innovations, but its fusion of simplicity, empirical effectiveness, and broad applicability remains influential in the ongoing evolution of memory-efficient long-context language modeling (Wang et al., 2024, Liu et al., 13 Mar 2025, Tian et al., 14 Apr 2025, Liu et al., 1 Mar 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KVMerger.