CommonKV: Optimizing Transformer KV Caches
- CommonKV is a set of techniques that compress, merge, and share key-value caches across transformer layers and agents to reduce memory and latency overhead.
- It leverages SVD-based decompositions and adaptive group merging to extract and retain dominant shared information while minimizing redundant computations.
- CommonKV methods are training-free, architecturally non-intrusive, and have proven effective in improving performance in language models, collaborative recommendation, and multi-agent systems.
CommonKV refers to a class of techniques and principles for compressing, merging, and sharing key-value (KV) caches in large transformer models—either to reduce memory and latency overhead in autoregressive LLMs or to leverage collaborative information and minimize redundancy across layers, users, or agents. CommonKV methods are characterized by exploiting observed similarities—across tokens, layers, or entities—in the structure of key and value caches, and by decoupling shared and idiosyncratic components in their information content. This concept spans cross-layer parameter sharing (Wang et al., 22 Aug 2025), collaborative cross-user pools in recommendation (Li et al., 27 Jan 2026), and cross-context communication or reuse in multi-agent LLM systems (Ye et al., 14 Oct 2025).
1. Fundamental Problem and Motivations
Transformer-based LLMs require, at inference time, a full retention of token-wise key and value vectors for every layer. For sequence length , hidden dimension , and layers, the cumulative memory for the KV cache is , quickly exceeding GPU memory for long contexts. In related domains (recommendation, multi-agent systems), this is mirrored by the storage and latency demands of per-user or per-context caches. CommonKV approaches address the limits of existing cache compression techniques, which either:
- Compress individual tokens or layers but ignore redundancy across the hierarchy,
- Aggressively merge or quantize without structure, causing sharp performance drops at high compression ratios.
By leveraging empirical measurements (cosine similarity, SVD analysis) indicating high cross-layer, cross-user, or cross-agent similarity in cache contents, CommonKV frameworks aim to retain “dominant” or “common” information in an efficient form while preserving or even enhancing downstream predictive fidelity (Wang et al., 22 Aug 2025, Li et al., 27 Jan 2026, Ye et al., 14 Oct 2025).
2. Technical Foundations: SVD and Shared Subspaces
Empirical analyses demonstrate that hidden states across adjacent layers (or across users in collaborative settings) are highly similar, with cross-layer cosine similarity ≈0.9 for hidden states and ≈0.6 for key/value projections (Wang et al., 22 Aug 2025). SVD-based decompositions further reveal that the bulk of the information in and lies within a low-rank, shareable subspace.
In the context of collaborative recommendation, SVD partitioning of for user yields
and partitioning into retained () and residual () shows that the principal component is highly correlated across users, while the residual is idiosyncratic (Li et al., 27 Jan 2026). This structure motivates factoring the cache into globally shared (common) KV pools and small user-specific (or layer-specific) contributions.
In cross-layer LLM cache compression, a similar SVD over concatenated projection matrices for groups of layers,
enforces that all (layer- hidden states) project into a common subspace , yielding highly mergeable “latent” caches across layers (Wang et al., 22 Aug 2025).
3. Algorithmic Mechanisms and Compression Strategies
CommonKV methodologies are training-free (relying on offline decomposition), and operate without requiring model retraining or architectural modification. Common algorithmic steps include:
- Cross-Layer SVD Sharing: For each layer group, compute a truncated SVD of the concatenated projection matrices to obtain a shared basis and group-specific slices. At inference, cache only latent vectors for each and (Wang et al., 22 Aug 2025).
- Adaptive Group Merging: Score layer-groups by cosine similarity of latent caches (e.g., between first and last layers of the group). For groups exceeding a similarity threshold, merge their latent caches via a Fisher-weighted sum, with a global budget controlling the overall compression ratio (Wang et al., 22 Aug 2025).
- User/Context Pooling in Sequential Recommendation: Establish learnable, high-capacity global pools and routers that assign each token/user to a subspace, concatenating the fetched “common” vectors with user-specific projections (Li et al., 27 Jan 2026).
- Cross-Context KV Reuse (Multi-Agent): Maintain anchor pools of observed KV deviations for shared token blocks under diverse prefixes, and interpolate offsets at runtime to realign and reuse KV caches across agent contexts, with theoretical bounds on error (Ye et al., 14 Oct 2025).
- Budget Allocation: Dynamically assign compression resources based on measured similarity; groups or users with lower similarity are allocated more cache resources to avoid over-compression (Wang et al., 22 Aug 2025).
4. Theoretical Guarantees and Quality Bounds
The efficacy of CommonKV is supported by:
- SVD Approximation: The best rank- SVD approximation minimizes the Frobenius error across concatenated layers/users, ensuring that the retained shared subspace faithfully represents the dominant variational structure.
- End-to-End Loss Bounds: For attention outputs, the discrepancy due to cache compression is upper-bounded by the norm of the omitted value components, provided the reconstruction minimizes as in CUR-based methods (Sengupta et al., 18 Sep 2025).
- Cache Reuse Guarantees (Multi-Agent): Under Lipschitz continuity of model components, similarity in token or context embeddings ensures bounded error in reconstructed caches from anchor-based interpolation (Ye et al., 14 Oct 2025).
5. Empirical Results and Benchmarks
CommonKV variants consistently outperform prior cross-layer and low-rank baselines at high compression ratios. Representative results include:
- On LongBench with Llama-3.1-8B-Instruct, CommonKV achieves average task scores of 72.31 (C=0.3), 71.59 (C=0.5), and 68.19 (C=0.6), outperforming MiniCache, ThinK, and Palu, which rapidly degrade above C=0.2–0.5 (Wang et al., 22 Aug 2025).
- In Collaborative Recommendation, CollectiveKV attains cache compression ratios down to 0.8% (i.e., 125× reduction), while maintaining or even improving GAUC/AUC versus the full-cache baseline (Li et al., 27 Jan 2026).
- KVCOMM achieves up to 7.8× speedup (TTFT reduction from ~430 ms to ~55 ms) in five-agent multi-agent LLM pipelines, with average cache reuse rates of ~70% and negligible quality loss (Ye et al., 14 Oct 2025).
Performance is robust under varied compression ratios, model sizes, and task domains. Ablation studies confirm the necessity of adaptive merging, pool-size tuning, and loss balancing, with clear quality drops if any component is omitted.
6. Implementation, Integration, and Orthogonality
CommonKV methods are characterized by architectural non-intrusiveness and composability:
- Training-Free Offline SVD and Rewrite: The only modification is an offline SVD and a rewrite of projection matrices as . No fine-tuning or additional training data is required (Wang et al., 22 Aug 2025).
- No Inference Overhead: After cache fusion, decode speed matches the uncompressed baseline, since all necessary projections can be fused and reconstructed efficiently. Prefill runtime increases by ~5% due to sorting for budget allocation, compared to prohibitive online SVD costs in other baselines (Wang et al., 22 Aug 2025).
- Compatibility with Eviction and Quantization: CommonKV is orthogonal to eviction (SnapKV-style token dropping) and quantization (KVQuant 4-bit), enabling extreme cumulative compression (>98%) without significant loss (Wang et al., 22 Aug 2025, Li et al., 27 Jan 2026).
- Practical Considerations: Grouping size (), SVD rank (), and merge budget () should be tuned based on cross-layer similarity analyses. For sequential recommendation, the user-specific dimension () and global pool size () must be balanced for model fit and maximal compression (Li et al., 27 Jan 2026).
7. Extensions, Limitations, and Outlook
CommonKV generalizes to multiple axes of sharing—layer, user, context, and agent—but current evidence is concentrated in LLMs, recommendation, and multi-agent text pipelines. Notable limitations include:
- Cross-Layer Heterogeneity: Some groups of layers/users/contexts are inherently less similar, reducing the effectiveness of merging and requiring higher resource assignment.
- Model Assumptions: Multi-agent reuse methods (e.g., KVCOMM) currently assume RoPE-based position encoding and model parameter homogeneity. Extensions to heterogeneous or multimodal systems are open.
- Memory Footprint of Routing/Anchors: Global indexes and anchor pools introduce modest additional storage, but this is overwhelmed by the savings in avoided large cache arrays.
A plausible implication is that as model scales and context lengths increase, CommonKV-style techniques may become essential to enable practical inference on commodity hardware, particularly when combined with orthogonal strategies such as quantization, adaptive token selection, and memory-efficient attention kernels.
Key references:
- "CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing" (Wang et al., 22 Aug 2025)
- "CollectiveKV: Decoupling and Sharing Collaborative Information in Sequential Recommendation" (Li et al., 27 Jan 2026)
- "KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems" (Ye et al., 14 Oct 2025)