Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
11 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

KV Sharing Variant: Methods & Advances

Updated 15 July 2025
  • KV Sharing Variant is an approach that modifies Transformer architectures to reuse key and value representations, cutting memory and computational overhead.
  • It employs intra-layer, cross-layer, and cross-request strategies, often combined with compression and low-rank techniques to optimize performance.
  • Advanced methods dynamically select sharing configurations to balance throughput, accuracy, and resource efficiency in high-demand inference settings.

A key–value (KV) sharing variant is any architectural or algorithmic modification to the standard Transformer or attention-based model that enables the sharing or re-use of key and/or value representations—originally computed during forward passes—to reduce memory and computational overhead. Such sharing can take place within a layer (intra-layer), across different layers (inter-layer or cross-layer), or across requests, and may employ various compression, grouping, or scheduling strategies. The concept is motivated by the observation that many KV representations are redundant, highly similar, or not uniformly needed, especially in long-context or high-throughput inference settings.

1. Fundamental Principles of KV Sharing Variants

A standard Transformer layer uses three projections—queries (Q), keys (K), and values (V)—to compute attention using the function A=Softmax(αQKT)VA = \mathrm{Softmax}(\alpha Q K^T) V, where α\alpha is a scaling constant, and attention scores are computed independently at each layer. In contrast, KV sharing variants modify this scheme to share the K and/or V representations:

  • Intra-layer sharing: Keys and/or values are shared across multiple heads within a single attention layer, as in multi-query attention (MQA) or grouped-query attention (GQA) (2412.19442).
  • Cross-layer (inter-layer) sharing: Keys/values computed at a particular layer are reused at other layers instead of recomputing or caching separately (2410.14442, 2410.18517, 2507.08045).
  • Cross-request sharing: KV caches produced from similar or identical prefixes across batched or multi-tenant inference sessions are reused for different requests (2412.03594, 2503.16525).

Sharing may be combined with compression strategies (low-rank decomposition, quantization, token pruning) to further reduce the memory footprint.

2. Cross-Layer KV Sharing: Strategies and Frameworks

Recent research systematically examines techniques to share KV caches across Transformer layers. A unified framework (2410.14442) describes a mapping kv(i)\text{kv}(i) that determines for each layer ii which layer’s KV cache it uses, allowing for a spectrum of sharing configurations:

  • "Bottom" (downward) sharing: Higher layers reuse KVs from earlier layers.
  • "Top" (upward) sharing: Lower layers reuse KVs from later layers.
  • "Middle" (intermediate) configurations: Flexible schemes that use a target position (bottom, top, or middle) within each group.

Partitioning strategies include:

  • Pizza: KV layers are grouped at the start.
  • Sandwich: KV layers split across beginning and end.
  • Lasagna: Layers are grouped uniformly.

Novel variants also consider cyclic dependencies (where a non-KV layer needs KVs that are not yet computed), which are resolved via token masking or iterative training (2410.14442). This systematic treatment enables tuning the trade-off between memory reduction, throughput, and accuracy.

Other approaches, such as KVSharer (2410.18517), propose a plug-and-play method that searches for dissimilar layer pairs (based on KV cache Euclidean distance) for sharing: surprisingly, reusing dissimilar rather than similar KV caches better preserves downstream performance.

The Krul system (2507.08045) extends these ideas by dynamically adapting the sharing configuration per conversation, computing tokenwise (and batchwise) attention similarities, and preemptively excluding "input-sensitive" layers from sharing to minimize information loss for multi-turn responses.

3. Compression and Low-Rank Decomposition in KV Sharing

Many KV sharing variants integrate compression via low-rank or SVD-based techniques:

  • Channel shrinking: CSKV (2409.10593) identifies that the KV cache exhibits significant redundancy in the channel dimensions, using singular value decomposition to reduce representation dimensions.
  • Cross-layer SVD: xKV (2503.18893) finds that the dominant singular vectors of KV caches are highly aligned across multiple layers. By applying SVD jointly to a group of layers, xKV forms a shared low-rank subspace (e.g., [Xl1,Xl2,...]UrSrVrT[X_{l_1}, X_{l_2}, ...] \approx U_r S_r V_r^T), leading to compression rates up to 6.8×\times over previous cross-layer merging techniques, sometimes even improving accuracy.
  • Latent space sharing: CLLA (2410.15252) projects all layer hidden states to a shared latent space and stores latent KV states, allowing further sharing and quantization (down to 2% of the original KV size) across layers.
  • Distribution-aware merging: KeepKV (2504.09936) introduces an adaptive merging method with an "Electoral Votes" mechanism that tracks the merging history of each entry, adjusting attention contributions to ensure zero output perturbation.

Such strategies are designed to preserve key information while enabling groups of layers (not just a single layer) to share compressed representations, reducing the number of distinct KV caches that must be maintained and improving memory efficiency.

4. Adaptive and Dynamic Sharing Algorithms

Recent methods incorporate dynamic and context-sensitive KV sharing strategies:

  • Dynamic selection based on attention similarity: Krul (2507.08045) performs runtime assessment of attention weights, only sharing between layers determined to exhibit "invariant" attention (such as those always focusing on the initial/recent tokens), with the decision re-evaluated per conversation.
  • Task-adaptive grouping: WindowKV (2503.17922) maintains semantic coherence in the cache by selecting contiguous windows of tokens whose attention score is high, with groupwise intra-layer indices sharing to reduce redundancy and compute cost.
  • Flexible route assignment: mixSGA (2506.13541) deploys a mixture-of-experts system, where tokens are dynamically routed to different experts (with varying KV grouping strategies) based on learned importance scores, allowing each token to receive proportional resources.
  • Hybrid cache management: TailorKV (2505.19586) adopts a hybrid model, quantizing "quantization-friendly" layers and dynamically offloading or retrieving dominant tokens in "sparsity-friendly" layers, optimizing both compute and memory.

These algorithms often combine online estimators, auxiliary loss for routing consistency, or per-batch/per-decode adaptation to balance accuracy, latency, and resource usage.

5. System-Level KV Sharing in Multi-Tenant and Batched Settings

KV sharing is also applied at the scheduling and system level to enable cross-request reuse and enhance throughput:

  • Global prefix sharing and scheduling: BatchLLM (2412.03594) constructs a global prefix tree for each large batch, allowing the explicit scheduling and grouping of requests to maximize shared KV cache reuse. The system leverages memory-centric token batching and horizontally fused attention kernels for efficiency.
  • Semantic and differential cache reuse: KVShare (2503.16525) supports sharing caches across multi-tenant queries—not just for identical prefixes, but also for semantically similar prompts. A DELTA tree tracks fine-grained differences, marking placeholder tokens for selective recomputation, while a scheduler prioritizes requests with high KV hit rates.
  • Private and secure inference: In privacy-preserving scenarios, such as secure multi-party computation (MPCache) (2501.06807), static and query-aware dynamic selection algorithms are used to keep only the most relevant caches and share indices across adjacent layers, reducing secure protocol overhead.

Such mechanisms help reduce KV cache lifetimes, increase GPU utilization, compress caches across multiple user requests, and lower time-to-first-token (TTFT) and throughput latency.

6. Comparative Evaluation, Trade-offs, and Limitations

Empirical evaluations across LLMing benchmarks, conditional generation, and multi-turn settings show:

  • Throughput: KV sharing variants can double or more the throughput and reduce per-token decoding latencies (e.g., Krul: 1.5×\times–2.68×\times TTFT reduction (2507.08045); BatchLLM: 1.3×\times–10.8×\times throughput improvement (2412.03594)).
  • Accuracy retention: Depending on configuration, reductions in cache size by a factor of 2–6×\times (and even 8×\times–50×\times for aggressive compression with quantization) are possible with negligible or modest degradation in model performance (2410.15252, 2503.18893).
  • Trade-offs: Aggressive cross-layer sharing increases the risk of accuracy loss, especially in input-sensitive or early layers, or when cyclic dependencies require iterative training and extra inference steps (2410.14442). The complexity of configuration (e.g., dynamic vs. static, bottom vs. top sharing) should be chosen based on the latency/memory/accuracy trade-off suitable for the deployment scenario.
  • Compatibility: Plug-and-play approaches (e.g., KVSharer) can be combined with intra-layer compression for cumulative gains (2410.18517); joint approaches (e.g., hybrid quantization and offloading in TailorKV (2505.19586)) can be hardware-aware.

Scalability to very long contexts and robustness across diverse tasks remain ongoing research topics, along with the need for universality across model classes and practical integration into modern serving infrastructures.

7. Future Outlook and Research Directions

Open directions highlighted in the literature include:

  • Dynamic and fine-grained strategy selection: Automating layerwise or tokenwise adaptation based on input and task, possibly with online estimators or learned controllers (2507.08045, 2506.13541).
  • Cross-layer and cross-request combinatorics: Broader reuse of caches not only across layers but also across time (multi-turn, streaming) and requests (via semantic or syntactic match) (2503.16525).
  • Integration with emerging architectures: Extending the methods to Multi-Head Latent Attention, state-space models, or models with non-standard attention (2503.18893).
  • Lossless and theoretically grounded compression: Ensuring zero output perturbation (e.g., ZIP-Merging in KeepKV (2504.09936)), or balancing via discrepancy theory for streaming attention (2502.07861).
  • System and hardware co-optimization: Further leveraging asynchronous compute, dynamic pipeline balancing, and cache-aware scheduling to close the gap between theoretical and realized efficiency (2412.03594, 2505.19586).

In sum, KV sharing variants represent a rapidly advancing set of methods for scaling Transformer inference to long contexts, high-throughput, and memory-constrained environments, with ongoing research aimed at deeper adaptivity, broader compatibility, and minimal loss of model accuracy.