Cross-layer KV Cache Sharing
- Cross-layer KV cache sharing is a method that reduces memory overhead in transformer decoders by reusing, merging, or compressing caches across layers.
- It categorizes strategies such as direct reuse, similarity-based gating, and low-rank subspace consolidation to achieve significant cache compression and throughput improvements.
- Empirical results demonstrate up to 50% memory reduction and enhanced inference speed while maintaining minimal accuracy loss in large language models.
Cross-layer Key-Value (KV) cache sharing is a methodological class for reducing the key and value memory overhead in multi-layer transformer decoders by reusing, merging, or compressing the KV caches across different layers, rather than treating each layer’s cache as fully independent. This paradigm addresses both the O(L·T·D) scaling of memory usage—where L is the number of layers, T the sequence length, and D the hidden dimension—and the practical deployment bottlenecks of LLM inference, especially for long-context or batch-parallel generation. Research in this area spans purely architectural innovations, data-driven one-shot merging strategies, orthogonal hardware-layer system integration, and composable hybrid techniques incorporating quantization, pruning, and low-rank subspace recovery.
1. Formal Definition and Core Motivations
In a standard L-layer transformer decoder, the inference-time KV cache comprises two tensors per layer: accumulating across time into , with S the unrolled sequence. For all L layers and both keys and values, the total storage (in “elements”) is: Cross-layer KV cache sharing schemes seek to realize by replacing or compressing layer-wise caches, motivated by the substantial (often >80%) fraction of on-device memory that K/V storage demands during LLM decoding and by empirical observations of redundancy or recoverability in K/V representations across the depth of modern decoders (Yang et al., 24 Oct 2024, Li et al., 27 Dec 2024).
2. Methodological Taxonomy of Cross-Layer KV Sharing
Current approaches are categorized by the nature of sharing, adaptation strategy, and combinations with intra-layer methods:
- Direct reuse or pointer sharing: Later layers use the KV cache of earlier “anchor” layers without recomputation (Yang et al., 20 Oct 2024, Brandon et al., 21 May 2024).
- E.g., Cross-Layer Attention (CLA): partition layers into groups of size c, anchor computes KV, followers reuse it. KV cache memory shrinks by c-fold (Brandon et al., 21 May 2024).
- Similarity/Distance-gated reuse: Sharing is applied only if adjacent layers’ caches are sufficiently (dis)similar according to a specific metric (e.g., L1/L2/cosine distance of per-head values) (Yang et al., 24 Oct 2024, Roy et al., 7 Dec 2025).
- Low-rank subspace consolidation: SVD is used to merge KV caches of a group of G layers into a single low-rank basis plus per-layer coefficients (Chang et al., 24 Mar 2025, Wang et al., 22 Aug 2025).
- Fusion and asymmetric mapping: Upper-layer KV caches are constructed as learnable or rule-based fusions of specific lower-layer (e.g., middle/bottom) caches, motivated by empirical signal propagation asymmetry (Lin et al., 3 Dec 2025).
- Index or block sharing (system-level): Sparse or paged caches re-use index tables or blocks between adjacent layers/steps, with dynamic filtering (Zeng et al., 12 Jan 2025, Chen et al., 29 Jul 2025).
- Quantization and compression synergy: Cross-layer sharing is coupled with low-bit quantization and/or autoencoder-based per-layer compression, often in a plug-and-play manner (Yang et al., 13 Oct 2025, Yang et al., 20 Oct 2024, Roy et al., 7 Dec 2025).
3. Key Algorithms and Theoretical Analysis
Representative workflows center on three axes: similarity computation, merge/reuse decision, and post-processing/reconstruction.
3.1 Proxy similarity metrics
- Cosine-similarity or Lp distance between average/flattened KV vectors: , or (Yang et al., 24 Oct 2024, Roy et al., 7 Dec 2025).
- Per-head distances for fine-grained selective sharing: (Roy et al., 7 Dec 2025).
3.2 Merge strategy & scheduling
- Greedy selection of sharing pairs by highest dissimilarity (counterintuitive result: dissimilar caches preserve performance better) (Yang et al., 24 Oct 2024).
- Layer-wise SVD and threshold-based grouping, with optional adaptive budget allocation via cosine similarity of latent keys (Chang et al., 24 Mar 2025, Wang et al., 22 Aug 2025).
- For asymmetric fusion, explicit constraints (e.g., only value caches from bottom, keys from bottom/middle) and preservation of positional information (e.g., RoPE-space fusion with symmetric weighting) (Lin et al., 3 Dec 2025).
3.3 Memory savings and compute model
- Relative compression, e.g., sharing C out of L layers yields
for direct reuse (Yang et al., 24 Oct 2024).
- For groupwise low-rank sharing:
where B is the bit-width, G is group size, and n is tensor length (Yang et al., 13 Oct 2025).
3.4 Example pseudocode: selective cross-layer reuse (Yang et al., 24 Oct 2024)
1 2 3 4 5 6 7 |
for each input x, for each decoding step t: for layer ℓ = 1 ... L: if (i←ℓ) in sharing_strategy: K_ℓ(t), V_ℓ(t) = K_i(t), V_i(t) # copy from layer i else: K_ℓ(t), V_ℓ(t) = compute as usual append to running cache |
4. Empirical Results: Compression, Speed, and Accuracy
- Direct cross-layer reuse (CLA, YOCO):
- 50% cache memory reduction with <0.1–2% PPL loss (Brandon et al., 21 May 2024, Yang et al., 20 Oct 2024, Lin et al., 3 Dec 2025).
- Full stack throughput speedup up to 1.65×, especially in long-sequence regimes (Yang et al., 24 Oct 2024, Brandon et al., 21 May 2024).
- Dissimilarity-based strategy (KVSharer):
- 25% layers shared: 28% memory savings, 1.65× generation acceleration, ≤10% PPL increase (Yang et al., 24 Oct 2024).
- Scaling to 30–35% as sequence length grows (e.g., 72% memory of baseline at 2048 tokens).
- Fusion (FusedKV/FusedKV-Lite):
- 50% memory reduction, lowest perplexity of all 50%-memory methods, outperforms both vanilla CLA/YOCO and GQA (Lin et al., 3 Dec 2025).
- Low-rank SVD consolidation (xKV, CommonKV):
- xKV: up to 6.8× higher compression (relative to competitive inter-layer baselines), 2.7% higher accuracy (Chang et al., 24 Mar 2025).
- CommonKV: up to 98% effective compression with negligible PPL loss when combined with quantization/eviction (Wang et al., 22 Aug 2025).
- Similarity-guided per-head reuse (KV-CAR):
- 6.6–12.5% savings at <0.5 PPL increase with careful head selection; up to 47.8% reduction when combined with per-layer autoencoder (Roy et al., 7 Dec 2025).
- System-level/Index-based reuse (MemShare, MPCache):
- Up to 85% throughput improvement (MemShare) (Chen et al., 29 Jul 2025).
- 1.9× faster, 5.9× lower communication than full-cache in MPC inference (MPCache) (Zeng et al., 12 Jan 2025).
5. Architectural and Systems Integration
- Architectural considerations: Most methods exploit the redundancy present in middle/deep layers, favoring sharing in those regions. Early layers often require distinct caches due to greater contextual variability (practices: share only for ℓ > L/3, or use dynamic gating) (Li et al., 27 Dec 2024, Wu et al., 18 Oct 2024).
- Hybridization and composability: Orthogonal stacking of cross-layer sharing with intra-layer pruning (token selection), quantization (int4/int2), and autoencoding is common. Techniques are often “plug-and-play,” requiring no model retraining (e.g., xKV, KVSharer, CLLA) (Yang et al., 24 Oct 2024, Chang et al., 24 Mar 2025, Yang et al., 13 Oct 2025).
- System-level orchestration: Enterprise LLM serving benefits from cross-query and cross-session cache sharing, enabling prefill offloading, tiered memory orchestration, and even agent-level prefix alignment (LMCache, KVCOMM) (Cheng et al., 8 Oct 2025, Ye et al., 14 Oct 2025).
- Scheduling and dynamic selection: Dynamic attention similarity estimation and online cost models (e.g., recompute/load tradeoff in Krul) enable optimal partitioning of which layers to share vs. recompute at restore time (Wen et al., 10 Jul 2025).
6. Trade-offs, Limitations, and Best Practices
| Challenge | Manifestation | Solution/Best Practice |
|---|---|---|
| Accuracy trade-off | Over-sharing leads to degraded PPL/accuracy | Adaptive similarity thresholds; hybrid with quant/AE |
| Layer/Head heterogeneity | Early layers/important heads are less redundant | Share only deeper layers or non-critical heads |
| Added compute overhead | SVD, similarity, or head alignment adds FLOPs | Use static mapping, efficient pre-pass calibration |
| System bottlenecks | CPU/GPU sync, data movement can become limiting | Batched movement, asynchronous scheduling, pointer tables |
| Hardware adaptability | Integration with paged attention, tensor parallelism | Minimal code change, modular connectors |
Empirical studies consistently find that moderate sharing rates (25–50% layers, or subset of heads) enable near-lossless (<2% PPL) compression and 1.3–2× throughput gains (Yang et al., 24 Oct 2024, Lin et al., 3 Dec 2025, Roy et al., 7 Dec 2025). More aggressive ratios require accompanying low-rank or quantization methods—e.g., CLLA achieves ≈2% total cache size with lossless performance via the combination of cross-layer sharing, low-rank latent compression, and int4 quantization (Yang et al., 20 Oct 2024).
7. Outlook, Extensions, and Open Problems
The cross-layer KV cache sharing paradigm is rapidly extending into areas such as:
- Fine-grained, semantic token-based sharing: LSH-based token matching for cache pointer redirection (across prompts and/or layers) (Zhao et al., 29 Sep 2025).
- Dynamic, workload-aware hybridization: Per-conversation or per-task optimization of sharing patterns and compression parameters, e.g., Krul’s token-wise attention pattern similarity (Wen et al., 10 Jul 2025).
- Fusion with system-level pipelining and orchestration: Multi-agent KV sharing, prefill offloading, and distributed cache management for LLM serving at scale (Cheng et al., 8 Oct 2025, Ye et al., 14 Oct 2025).
- Unified theoretical foundations: A general theory of recoverability, redundancy, and information preservation under cross-layer re-use remains open.
Common open questions include: precise characterization of which layers/heads are best merged; universal standards for calibration and validation; and boundary conditions under continued scaling or more exotic architectures (e.g., Mixture-of-Experts). Nevertheless, cross-layer KV cache sharing is a foundational ingredient for the next generation of efficient LLM inference (Li et al., 27 Dec 2024).