KVMerger: Adaptive KV Cache Compression
- KVMerger is a similarity-driven key–value cache compression algorithm that merges contiguous tokens with high cosine similarity to reduce memory usage during LLM inference.
- It leverages Gaussian kernel weighted merging and empirical key similarity phenomena to fuse clusters while preserving crucial contextual information without retraining the model.
- Empirical evaluations show that KVMerger maintains high model accuracy and throughput, outperforming standard eviction and quantization methods at significant compression rates.
KVMerger is a family of adaptive key–value (KV) cache compression algorithms designed for LLM inference under constrained memory budgets. It targets the exponential growth of KV cache during autoregressive decoding, enabling longer context windows, higher throughput, and lower system memory requirements without sacrificing model accuracy. KVMerger exploits the empirical and often theoretical observation that key vectors produced by LLM attention modules exhibit strong intra-sequence similarity, enabling their compressive merging via similarity-aware, information-preserving strategies. By favoring information fusion over simple eviction, KVMerger and its derivatives represent a significant advancement over prior pure-dropping or quantization-based approaches in long-context LLM serving.
1. Motivation and Context
The KV cache stores hidden states (keys and values) for every token at each attention layer and head during incremental decoding. This memory grows linearly with sequence length , number of layers , and number of heads , causing prohibitive GPU memory requirements for large-scale inference (e.g., over 1.2 TB for GPT-3 with and ) (Wang et al., 2024, Liu et al., 13 Mar 2025). Standard solutions like quantization and cache eviction reduce memory, but the latter irreversibly discards context, harming quality on tasks that require access to long-range dependencies. Merging-based strategies, such as KVMerger, address this by fusing similar KV entries many-to-one, retaining more contextual information for future inference.
Two key empirical phenomena motivate KVMerger (Wang et al., 2024):
- Token-level key similarity: Adjacent key vectors frequently have cosine similarity across heads/layers (notably under rotary positional encoding), yielding strong “locality” and justifying their fusion.
- Model-level KV sparsity: When merged according to a global similarity threshold, the resulting compression ratio is stable across datasets, tasks, and even layers, indicating that a model-specific threshold can reliably target desired compression rates.
2. Merging Set Identification and Gaussian-Kernel Weighted Merging
KVMerger merges “clusters” (merging sets) of consecutive tokens whose key similarity exceeds a tunable threshold (default 0.75), as measured by cosine similarity
Merging set identification is formulated as greedy agglomerative hierarchical clustering with the locality constraint that clusters are contiguous (Wang et al., 2024).
For each merging set , KVMerger selects a pivotal token with maximal aggregated attention and computes Gaussian kernel affinities
(normalized to weights ). The merged key and value at position are then
with bandwidth set adaptively per group. This kernel-based fusion preserves local information, yielding low compression-induced degradation.
Exclusion heuristics are employed: heavy-hitter tokens (top attention per group) and a recent-token window are unmerged to prevent critical information loss. Merging is applied to the remaining slots, ensuring compliance with a target budget (e.g., 35% or 50% of original size) (Wang et al., 2024).
3. Memory, Computational Complexity, and Implementation
KVMerger achieves substantial memory savings. For a typical 50% budget, about 17% of slots are preserved for most-recent tokens, 12% for attention heavy-hitters, and the remaining 71% are merged into 35% of slots, for a net 50% retention. At 35% budget, even higher merging rates are possible (Wang et al., 2024).
The merging set identification and fusion are per layer/head, negligible compared to self-attention's complexity. The overall system overhead is modest relative to baseline attention modules.
Integration is straightforward: after each KV append, compression is triggered when the cache exceeds budget. Merging is applied per layer/head except potentially in the first and last layers, or under short-context regimes where key similarity structure is absent. KVMerger operates on the raw cache, requiring no retraining or alteration to model weights.
4. Empirical Performance and Ablation
Across Llama2-7B-chat, Llama2-13B-chat, and multiple long-context benchmarks (LongBench, ZeroScrolls, RAG retrieval), KVMerger consistently outperforms competing cache compression schemes such as H2O (eviction-based) and CaM (value-state merging plus eviction). For instance (Wang et al., 2024):
| Method | LongBench avg (7B) | ZeroScrolls avg | RAG Hit-rate | Budget % |
|---|---|---|---|---|
| Full KV | 35.22 | 15.30 | Highest | 100% |
| H2O | 34.00 | 14.36 | Lower | 50% |
| CaM | 33.67 | 14.24 | Lower | 50% |
| KVMerger | 35.02 | 15.21 | Highest | 50% |
| KVMerger | 34.50 | 14.84 | Highest | 35% |
Ablations show best performance for per-group . Pivotal-token selection via top aggregated attention outperforms random choice. Aggressive thresholds () can lead to over-compression and degraded accuracy, especially in early/late layers.
A similar “KVMerger”–style mechanism is embedded within ZeroMerge (ZSMerge) (Liu et al., 13 Mar 2025), which generalizes the approach with per-head importance allocation and residual merging, supporting parameter-free, zero-shot application across models and up to 20:1 compression (5% cache).
5. Theoretical and Algorithmic Limitations
While KVMerger significantly reduces memory and supports robust generation, it is subject to two notable limitations:
- In merging, convex combinations of keys and values may induce “attention sag,” i.e., the attention mass of a merged key is strictly less than the sum of the originals, resulting in (Tian et al., 14 Apr 2025).
- Purely similarity-driven clustering can, under low similarity thresholds, group semantically disparate tokens, introducing noise.
Later work, such as KeepKV (Tian et al., 14 Apr 2025), explicitly quantifies and eliminates this output drift by introducing vote-tracking and a zero-perturbation ZIP-Merging scheme, pushing per-step output deviation to zero and enabling provable guarantees on down-stream accuracy.
6. Relationship to Alternative Merging and Compression Approaches
KVMerger sits within an ecosystem of KV cache optimizers:
- Pure Eviction (e.g., H2O, StreamingLLM): Selectively drops low-importance entries based on attention or positional heuristics, incurring irreversible information loss.
- Parameterization and Learned Fusion (e.g., LESS): Learns data-driven fusion networks with additional tuning requirements.
- Asymmetric/Hessian-based Merging (e.g., AsymKV, KVSlimmer): Employs rigorous second-order analysis to capture key–value asymmetry and derive mathematically principled fusion rules (Liu et al., 1 Mar 2026).
- Zero-Shot Residual Merging (ZeroMerge/ZSMerge): Incorporates head-granular importance, momentum-based residual fusion, and softmax compensation, wholly eliminating the need for model retraining (Liu et al., 13 Mar 2025).
- Zero-perturbation Merging (KeepKV): Introduces electoral votes and exact preservation of attention aggregates, theoretically eliminating stepwise output drift (Tian et al., 14 Apr 2025).
KVMerger occupies the intermediate space, balancing algorithmic simplicity, model-agnostic deployment, and high empirical efficacy, albeit with potential for small perturbations in output.
7. Practical Integration and Tuning
Optimal deployment of KVMerger involves:
- Setting the similarity threshold to achieve desired budget-compression trade-offs.
- Excluding critical tokens—recent and attention-heavy—from merging.
- Computing the Gaussian kernel bandwidth per merging set (typically close to 5).
- Disabling or tightening merging in early and late layers, particularly for long-context tasks.
- For extremely large models, grouping keys block-wise or using approximate similarity search for scalability.
Limitations arise with negligible similarity structure (short contexts, first/last layers), where merging has limited utility or may degrade model performance. Hyperparameters are typically robust to model and benchmark variation, but may require further calibration for best results in extreme compression regimes.
KVMerger has established itself as a baseline for practical, model-agnostic, similarity-driven KV cache compression for LLMs. Its core principles have been subsumed and extended in subsequent algorithmic innovations, but its fusion of simplicity, empirical effectiveness, and broad applicability remains influential in the ongoing evolution of memory-efficient long-context language modeling (Wang et al., 2024, Liu et al., 13 Mar 2025, Tian et al., 14 Apr 2025, Liu et al., 1 Mar 2026).