Papers
Topics
Authors
Recent
Search
2000 character limit reached

CommonKV: Optimizing Transformer KV Caches

Updated 2 February 2026
  • CommonKV is a set of techniques that compress, merge, and share key-value caches across transformer layers and agents to reduce memory and latency overhead.
  • It leverages SVD-based decompositions and adaptive group merging to extract and retain dominant shared information while minimizing redundant computations.
  • CommonKV methods are training-free, architecturally non-intrusive, and have proven effective in improving performance in language models, collaborative recommendation, and multi-agent systems.

CommonKV refers to a class of techniques and principles for compressing, merging, and sharing key-value (KV) caches in large transformer models—either to reduce memory and latency overhead in autoregressive LLMs or to leverage collaborative information and minimize redundancy across layers, users, or agents. CommonKV methods are characterized by exploiting observed similarities—across tokens, layers, or entities—in the structure of key and value caches, and by decoupling shared and idiosyncratic components in their information content. This concept spans cross-layer parameter sharing (Wang et al., 22 Aug 2025), collaborative cross-user pools in recommendation (Li et al., 27 Jan 2026), and cross-context communication or reuse in multi-agent LLM systems (Ye et al., 14 Oct 2025).

1. Fundamental Problem and Motivations

Transformer-based LLMs require, at inference time, a full retention of token-wise key and value vectors for every layer. For sequence length TT, hidden dimension dd, and LL layers, the cumulative memory for the KV cache is O(2LTd)O(2LTd), quickly exceeding GPU memory for long contexts. In related domains (recommendation, multi-agent systems), this is mirrored by the storage and latency demands of per-user or per-context caches. CommonKV approaches address the limits of existing cache compression techniques, which either:

  • Compress individual tokens or layers but ignore redundancy across the hierarchy,
  • Aggressively merge or quantize without structure, causing sharp performance drops at high compression ratios.

By leveraging empirical measurements (cosine similarity, SVD analysis) indicating high cross-layer, cross-user, or cross-agent similarity in cache contents, CommonKV frameworks aim to retain “dominant” or “common” information in an efficient form while preserving or even enhancing downstream predictive fidelity (Wang et al., 22 Aug 2025, Li et al., 27 Jan 2026, Ye et al., 14 Oct 2025).

2. Technical Foundations: SVD and Shared Subspaces

Empirical analyses demonstrate that hidden states across adjacent layers (or across users in collaborative settings) are highly similar, with cross-layer cosine similarity ≈0.9 for hidden states and ≈0.6 for key/value projections (Wang et al., 22 Aug 2025). SVD-based decompositions further reveal that the bulk of the information in KK and VV lies within a low-rank, shareable subspace.

In the context of collaborative recommendation, SVD partitioning of KuK_u for user uu yields

Ku=UuΣuVuK_u = U_u \Sigma_u V_u^\top

and partitioning into retained (KupK_u^p) and residual (KurK_u^r) shows that the principal component is highly correlated across users, while the residual is idiosyncratic (Li et al., 27 Jan 2026). This structure motivates factoring the cache into globally shared (common) KV pools and small user-specific (or layer-specific) contributions.

In cross-layer LLM cache compression, a similar SVD over concatenated projection matrices for groups of layers,

Wgi=[Wkl;Wvl;;Wvl+G1]Agi[Bkl;Bvl;]W_{g_i} = [W^l_k; W^l_v; \ldots; W^{l+G-1}_v] \approx A^{g_i} [B^l_k; B^l_v; \ldots]

enforces that all xlx^l (layer-ll hidden states) project into a common subspace AgiA^{g_i}, yielding highly mergeable “latent” caches across layers (Wang et al., 22 Aug 2025).

3. Algorithmic Mechanisms and Compression Strategies

CommonKV methodologies are training-free (relying on offline decomposition), and operate without requiring model retraining or architectural modification. Common algorithmic steps include:

  • Cross-Layer SVD Sharing: For each layer group, compute a truncated SVD of the concatenated projection matrices to obtain a shared basis AgiA^{g_i} and group-specific BB slices. At inference, cache only latent vectors htl=xtlAgih^l_t = x^l_t A^{g_i} for each tt and lgil \in g_i (Wang et al., 22 Aug 2025).
  • Adaptive Group Merging: Score layer-groups by cosine similarity of latent caches (e.g., between first and last layers of the group). For groups exceeding a similarity threshold, merge their latent caches via a Fisher-weighted sum, with a global budget CC controlling the overall compression ratio (Wang et al., 22 Aug 2025).
  • User/Context Pooling in Sequential Recommendation: Establish learnable, high-capacity global pools PK,PVP_K, P_V and routers that assign each token/user to a subspace, concatenating the fetched “common” vectors with user-specific projections Kus,VusK^s_u, V^s_u (Li et al., 27 Jan 2026).
  • Cross-Context KV Reuse (Multi-Agent): Maintain anchor pools of observed KV deviations for shared token blocks under diverse prefixes, and interpolate offsets at runtime to realign and reuse KV caches across agent contexts, with theoretical bounds on error (Ye et al., 14 Oct 2025).
  • Budget Allocation: Dynamically assign compression resources based on measured similarity; groups or users with lower similarity are allocated more cache resources to avoid over-compression (Wang et al., 22 Aug 2025).

4. Theoretical Guarantees and Quality Bounds

The efficacy of CommonKV is supported by:

  • SVD Approximation: The best rank-rr SVD approximation minimizes the Frobenius error across concatenated layers/users, ensuring that the retained shared subspace faithfully represents the dominant variational structure.
  • End-to-End Loss Bounds: For attention outputs, the discrepancy due to cache compression is upper-bounded by the norm of the omitted value components, provided the reconstruction minimizes VVF\|V - V'\|_F as in CUR-based methods (Sengupta et al., 18 Sep 2025).
  • Cache Reuse Guarantees (Multi-Agent): Under Lipschitz continuity of model components, similarity in token or context embeddings ensures bounded error in reconstructed caches from anchor-based interpolation (Ye et al., 14 Oct 2025).

5. Empirical Results and Benchmarks

CommonKV variants consistently outperform prior cross-layer and low-rank baselines at high compression ratios. Representative results include:

  • On LongBench with Llama-3.1-8B-Instruct, CommonKV achieves average task scores of 72.31 (C=0.3), 71.59 (C=0.5), and 68.19 (C=0.6), outperforming MiniCache, ThinK, and Palu, which rapidly degrade above C=0.2–0.5 (Wang et al., 22 Aug 2025).
  • In Collaborative Recommendation, CollectiveKV attains cache compression ratios down to 0.8% (i.e., 125× reduction), while maintaining or even improving GAUC/AUC versus the full-cache baseline (Li et al., 27 Jan 2026).
  • KVCOMM achieves up to 7.8× speedup (TTFT reduction from ~430 ms to ~55 ms) in five-agent multi-agent LLM pipelines, with average cache reuse rates of ~70% and negligible quality loss (Ye et al., 14 Oct 2025).

Performance is robust under varied compression ratios, model sizes, and task domains. Ablation studies confirm the necessity of adaptive merging, pool-size tuning, and loss balancing, with clear quality drops if any component is omitted.

6. Implementation, Integration, and Orthogonality

CommonKV methods are characterized by architectural non-intrusiveness and composability:

  • Training-Free Offline SVD and Rewrite: The only modification is an offline SVD and a rewrite of projection matrices as AgiBlA^{g_i}B^l_*. No fine-tuning or additional training data is required (Wang et al., 22 Aug 2025).
  • No Inference Overhead: After cache fusion, decode speed matches the uncompressed baseline, since all necessary projections can be fused and reconstructed efficiently. Prefill runtime increases by ~5% due to sorting for budget allocation, compared to prohibitive online SVD costs in other baselines (Wang et al., 22 Aug 2025).
  • Compatibility with Eviction and Quantization: CommonKV is orthogonal to eviction (SnapKV-style token dropping) and quantization (KVQuant 4-bit), enabling extreme cumulative compression (>98%) without significant loss (Wang et al., 22 Aug 2025, Li et al., 27 Jan 2026).
  • Practical Considerations: Grouping size (GG), SVD rank (rr), and merge budget (CC) should be tuned based on cross-layer similarity analyses. For sequential recommendation, the user-specific dimension (dud_u) and global pool size (mm) must be balanced for model fit and maximal compression (Li et al., 27 Jan 2026).

7. Extensions, Limitations, and Outlook

CommonKV generalizes to multiple axes of sharing—layer, user, context, and agent—but current evidence is concentrated in LLMs, recommendation, and multi-agent text pipelines. Notable limitations include:

  • Cross-Layer Heterogeneity: Some groups of layers/users/contexts are inherently less similar, reducing the effectiveness of merging and requiring higher resource assignment.
  • Model Assumptions: Multi-agent reuse methods (e.g., KVCOMM) currently assume RoPE-based position encoding and model parameter homogeneity. Extensions to heterogeneous or multimodal systems are open.
  • Memory Footprint of Routing/Anchors: Global indexes and anchor pools introduce modest additional storage, but this is overwhelmed by the savings in avoided large cache arrays.

A plausible implication is that as model scales and context lengths increase, CommonKV-style techniques may become essential to enable practical inference on commodity hardware, particularly when combined with orthogonal strategies such as quantization, adaptive token selection, and memory-efficient attention kernels.


Key references:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CommonKV.