Papers
Topics
Authors
Recent
2000 character limit reached

Cross-layer KV Cache Sharing

Updated 10 December 2025
  • Cross-layer KV cache sharing is a method that reduces memory overhead in transformer decoders by reusing, merging, or compressing caches across layers.
  • It categorizes strategies such as direct reuse, similarity-based gating, and low-rank subspace consolidation to achieve significant cache compression and throughput improvements.
  • Empirical results demonstrate up to 50% memory reduction and enhanced inference speed while maintaining minimal accuracy loss in large language models.

Cross-layer Key-Value (KV) cache sharing is a methodological class for reducing the key and value memory overhead in multi-layer transformer decoders by reusing, merging, or compressing the KV caches across different layers, rather than treating each layer’s cache as fully independent. This paradigm addresses both the O(L·T·D) scaling of memory usage—where L is the number of layers, T the sequence length, and D the hidden dimension—and the practical deployment bottlenecks of LLM inference, especially for long-context or batch-parallel generation. Research in this area spans purely architectural innovations, data-driven one-shot merging strategies, orthogonal hardware-layer system integration, and composable hybrid techniques incorporating quantization, pruning, and low-rank subspace recovery.

1. Formal Definition and Core Motivations

In a standard L-layer transformer decoder, the inference-time KV cache comprises two tensors per layer: K(t)=WKh1(t)RB×H×dk,V(t)=WVh1(t)RB×H×dkK_\ell(t) = W_K h_{\ell-1}(t) \in \mathbb{R}^{B \times H \times d_k}, \quad V_\ell(t) = W_V h_{\ell-1}(t) \in \mathbb{R}^{B \times H \times d_k} accumulating across time into KRB×H×S×dkK_\ell \in \mathbb{R}^{B \times H \times S \times d_k}, with S the unrolled sequence. For all L layers and both keys and values, the total storage (in “elements”) is: Memkv=2BLSD, with D=Hdk\text{Mem}_{kv} = 2 B L S D, ~\text{with}~ D = H d_k Cross-layer KV cache sharing schemes seek to realize Memkvshared<Memkv\text{Mem}^{\text{shared}}_{kv} < \text{Mem}_{kv} by replacing or compressing layer-wise caches, motivated by the substantial (often >80%) fraction of on-device memory that K/V storage demands during LLM decoding and by empirical observations of redundancy or recoverability in K/V representations across the depth of modern decoders (Yang et al., 24 Oct 2024, Li et al., 27 Dec 2024).

2. Methodological Taxonomy of Cross-Layer KV Sharing

Current approaches are categorized by the nature of sharing, adaptation strategy, and combinations with intra-layer methods:

3. Key Algorithms and Theoretical Analysis

Representative workflows center on three axes: similarity computation, merge/reuse decision, and post-processing/reconstruction.

3.1 Proxy similarity metrics

  • Cosine-similarity or Lp distance between average/flattened KV vectors: D(i,j)=1kˉi,kˉj/(kˉi2kˉj2)D(i,j) = 1 - \langle \bar k_i, \bar k_j \rangle / (\| \bar k_i \|_2 \| \bar k_j \|_2), or E(i,j)=kˉikˉj2E(i,j) = \| \bar k_i - \bar k_j \|_2 (Yang et al., 24 Oct 2024, Roy et al., 7 Dec 2025).
  • Per-head distances for fine-grained selective sharing: s,iK=1TDt,jK[i,t,j]K+1[i,t,j]s^K_{\ell,i} = \frac{1}{T \cdot D} \sum_{t,j} |K_\ell[i,t,j] - K_{\ell+1}[i,t,j]| (Roy et al., 7 Dec 2025).

3.2 Merge strategy & scheduling

  • Greedy selection of sharing pairs by highest dissimilarity (counterintuitive result: dissimilar caches preserve performance better) (Yang et al., 24 Oct 2024).
  • Layer-wise SVD and threshold-based grouping, with optional adaptive budget allocation via cosine similarity of latent keys (Chang et al., 24 Mar 2025, Wang et al., 22 Aug 2025).
  • For asymmetric fusion, explicit constraints (e.g., only value caches from bottom, keys from bottom/middle) and preservation of positional information (e.g., RoPE-space fusion with symmetric weighting) (Lin et al., 3 Dec 2025).

3.3 Memory savings and compute model

  • Relative compression, e.g., sharing C out of L layers yields

ΔMem=1LCL=CL\Delta \text{Mem} = 1 - \frac{L-C}{L} = \frac{C}{L}

for direct reuse (Yang et al., 24 Oct 2024).

  • For groupwise low-rank sharing:

Bitstot=2(L/G)nB,\text{Bits}_\text{tot} = 2 \cdot (L/G) \cdot n \cdot B,

where B is the bit-width, G is group size, and n is tensor length (Yang et al., 13 Oct 2025).

1
2
3
4
5
6
7
for each input x, for each decoding step t:
    for layer ℓ = 1 ... L:
        if (iℓ) in sharing_strategy:
            K_ℓ(t), V_ℓ(t) = K_i(t), V_i(t)  # copy from layer i
        else:
            K_ℓ(t), V_ℓ(t) = compute as usual
        append to running cache
This structure is modified in various ways for dynamic fusion, low-rank SVD, or selective block reuse.

4. Empirical Results: Compression, Speed, and Accuracy

5. Architectural and Systems Integration

  • Architectural considerations: Most methods exploit the redundancy present in middle/deep layers, favoring sharing in those regions. Early layers often require distinct caches due to greater contextual variability (practices: share only for ℓ > L/3, or use dynamic gating) (Li et al., 27 Dec 2024, Wu et al., 18 Oct 2024).
  • Hybridization and composability: Orthogonal stacking of cross-layer sharing with intra-layer pruning (token selection), quantization (int4/int2), and autoencoding is common. Techniques are often “plug-and-play,” requiring no model retraining (e.g., xKV, KVSharer, CLLA) (Yang et al., 24 Oct 2024, Chang et al., 24 Mar 2025, Yang et al., 13 Oct 2025).
  • System-level orchestration: Enterprise LLM serving benefits from cross-query and cross-session cache sharing, enabling prefill offloading, tiered memory orchestration, and even agent-level prefix alignment (LMCache, KVCOMM) (Cheng et al., 8 Oct 2025, Ye et al., 14 Oct 2025).
  • Scheduling and dynamic selection: Dynamic attention similarity estimation and online cost models (e.g., recompute/load tradeoff in Krul) enable optimal partitioning of which layers to share vs. recompute at restore time (Wen et al., 10 Jul 2025).

6. Trade-offs, Limitations, and Best Practices

Challenge Manifestation Solution/Best Practice
Accuracy trade-off Over-sharing leads to degraded PPL/accuracy Adaptive similarity thresholds; hybrid with quant/AE
Layer/Head heterogeneity Early layers/important heads are less redundant Share only deeper layers or non-critical heads
Added compute overhead SVD, similarity, or head alignment adds FLOPs Use static mapping, efficient pre-pass calibration
System bottlenecks CPU/GPU sync, data movement can become limiting Batched movement, asynchronous scheduling, pointer tables
Hardware adaptability Integration with paged attention, tensor parallelism Minimal code change, modular connectors

Empirical studies consistently find that moderate sharing rates (25–50% layers, or subset of heads) enable near-lossless (<2% PPL) compression and 1.3–2× throughput gains (Yang et al., 24 Oct 2024, Lin et al., 3 Dec 2025, Roy et al., 7 Dec 2025). More aggressive ratios require accompanying low-rank or quantization methods—e.g., CLLA achieves ≈2% total cache size with lossless performance via the combination of cross-layer sharing, low-rank latent compression, and int4 quantization (Yang et al., 20 Oct 2024).

7. Outlook, Extensions, and Open Problems

The cross-layer KV cache sharing paradigm is rapidly extending into areas such as:

  • Fine-grained, semantic token-based sharing: LSH-based token matching for cache pointer redirection (across prompts and/or layers) (Zhao et al., 29 Sep 2025).
  • Dynamic, workload-aware hybridization: Per-conversation or per-task optimization of sharing patterns and compression parameters, e.g., Krul’s token-wise attention pattern similarity (Wen et al., 10 Jul 2025).
  • Fusion with system-level pipelining and orchestration: Multi-agent KV sharing, prefill offloading, and distributed cache management for LLM serving at scale (Cheng et al., 8 Oct 2025, Ye et al., 14 Oct 2025).
  • Unified theoretical foundations: A general theory of recoverability, redundancy, and information preservation under cross-layer re-use remains open.

Common open questions include: precise characterization of which layers/heads are best merged; universal standards for calibration and validation; and boundary conditions under continued scaling or more exotic architectures (e.g., Mixture-of-Experts). Nevertheless, cross-layer KV cache sharing is a foundational ingredient for the next generation of efficient LLM inference (Li et al., 27 Dec 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Cross-layer KV Cache Sharing.