Papers
Topics
Authors
Recent
Search
2000 character limit reached

LayerKV: Layer-Aware KV Cache Compression

Updated 24 January 2026
  • LayerKV is a suite of techniques for managing and compressing the key-value cache in Transformer models, reducing the memory burden in long-context inference.
  • It employs methods like cross-layer sharing, adaptive eviction, and layer-wise quantization to achieve up to 90% memory reduction with minimal accuracy loss.
  • LayerKV techniques are integrated into serving stacks to enhance throughput and latency, enabling larger batch sizes and efficient offloading between CPU and GPU.

LayerKV is a general term encompassing a variety of techniques for layer-aware Key-Value (KV) cache management and compression within Transformer-based LLMs. These approaches target the reduction of the KV cache memory footprint, which grows linearly with sequence length and model depth, representing over 80% of total inference memory in typical deployments. Methods subsumed under LayerKV include cross-layer KV sharing, per-layer cache eviction, adaptive budget allocation, and layer-wise quantization. Recent literature demonstrates that LayerKV methods can achieve substantial memory and latency reduction—commonly up to 90% or more—while maintaining negligible performance loss.

1. Theoretical Motivation and Background

The use of a per-layer KV cache in Transformer inference is essential for efficient generation: each token’s hidden state is projected to keys and values for each layer, which are then used during self-attention at subsequent timesteps. The cost of storing all KV pairs is

O(Lâ‹…Tâ‹…d)O(L \cdot T \cdot d)

where LL is the number of layers, TT the context length, and dd the hidden dimension. For long-context inference, this may exceed GPU capacity, limiting throughput and maximum batch size (Yang et al., 2024, Li et al., 2024).

Empirically, not all layers require the same amount of cache to maintain model fidelity. Some tokens and layers can be pruned or quantized more aggressively, with limited accuracy degradation. LayerKV strategies exploit layerwise differences in attention patterns, hidden state similarity, and task sensitivity to optimize for both memory and compute (Wu et al., 2024, Xiong et al., 2024, Li et al., 2024).

2. Cross-Layer KV Sharing and Compression

Many LayerKV techniques use cross-layer sharing to structurally reduce redundancy:

  • Fixed Layer Sharing: Only a subset ll of LL layers store and update their own KV cache, and all other layers reuse those stored KV pairs, thus reducing cache size proportionally to l/Ll/L (Wu et al., 2024, Wu et al., 2024). Supported variants include "pizza-style," "sandwich-style," and "lasagna-style" mappings, each controlling the allocation and reuse of KV pairs across groups of layers.
  • Parameter Sharing with SVD Compression: Advanced methods unite adjacent layers into groups, applying truncated Singular Value Decomposition (SVD) on the concatenated projection matrices to construct a shared latent representation. During inference, the cache stores only these compressed latents, and all merged layers reconstruct their keys and values from the latent using layer-specific factors (Wang et al., 22 Aug 2025).
  • Layer-wise Dissimilarity-Based Sharing: Rather than merging similar layers, (Yang et al., 2024) finds that sharing more dissimilar KV caches among layers can better preserve downstream accuracy, contradicting intuition about redundancy. KVSharer employs an offline calibration to select which pairs to merge, ensuring minimal hidden-state perturbation.

These sharing schemes yield immediate 2–3× (or higher) reductions in cache memory, often matching vanilla performance within 1–2% perplexity on both language modeling and downstream tasks (Wu et al., 2024, Wu et al., 2024).

Method Core Mechanism Compression Ratio Performance Retention
Standard LayerKV Share every N-th NN-fold 98–100% (large ll)
SVD-based (CommonKV) SVD param-grouping up to 98% 90–95% (high α\alpha)
Dissimilarity-based Selective sharing 25–42% >95% (moderate)

3. Layer-Wise Budget Allocation and Adaptive Eviction

A major advance in LayerKV is the dynamic allocation of budget or cache slots to each layer according to token and layer importance. Several orthogonal metrics and algorithms are employed:

  • Attention-Mass Based Allocation: Methods such as PrefixKV, XKV, ZigZagKV, and CAKE estimate, per layer, the cumulative attention mass or importance score for tokens, often using the output of attention matrices computed during a mini-prefill phase or sliding window (Wang et al., 2024, Li et al., 2024, Zhong et al., 2024, Qin et al., 16 Mar 2025).
  • Uncertainty-Driven and Entropic Allocation: ZigZagKV uses the minimum per-layer budget required to retain, say, 90% attention from each head (LMBA) as an "uncertainty" score guiding the allocation (Zhong et al., 2024). CAKE aggregates entropy (spatial dispersion) and variance (temporal attention shift) as preference signals, allocating more cache to layers with diffuse and dynamic attention patterns (Qin et al., 16 Mar 2025).
  • Greedy Knapsack and Binary Search: Some approaches (e.g., XKV) pose cache allocation as a discrete knapsack optimization, solved by greedy selection, while PrefixKV employs a global binary search to meet total budget constraints with maximal retained "attention mass" (Li et al., 2024, Wang et al., 2024).

Dynamic allocation is typically far superior to uniform layer budgeting, consistently supporting larger memory reductions for the same accuracy loss and sometimes achieving up to 61.6% reduction in total cache with <2-point drop on multitask benchmarks (Li et al., 2024, Wang et al., 2024).

4. Per-Layer Quantization and Mixed Precision

LayerKV techniques increasingly combine selective eviction with layer-wise quantization:

  • Layer-Discriminative Low-Bit Quantization: MiniKV and KVmix deploy 2–4 bit quantization of key/value caches, but crucially, the number of bits per layer—and sometimes per-projection—depends on measured or gradient-based importance (Sharma et al., 2024, Li et al., 18 May 2025). In MiniKV, lower layers receive larger persistent caches and potentially higher precision, while upper layers are aggressively quantized, following a "pyramid" allocation.
  • CUDA/FlashAttention Integration: MiniKV designs custom CUDA kernels compatible with FlashAttention, which maintain column-wise accumulated scores and quantize/dequantize on the fly during attention matmuls (Sharma et al., 2024). Quant+Concat and Dequant+MatVec are fused operations that minimize the runtime cost of quantization.

Empirical results show that 2–4 bit per-layer quantization (with per-layer dynamic allocation) yields up to an 8× reduction in total KV cache memory with ≥98.5% accuracy recovery on standard tasks, outperforming uniform or token-only quantization (Sharma et al., 2024, Li et al., 18 May 2025).

Method Quantization Bits Compression Ratio Avg Score Retained
MiniKV-Pyramid 2 (layer-adaptive) 86% ≥98.5%
KVmix k2.19/v2.38 4.9× +0.92% ΔAcc

5. Cache Eviction and Defensive Aggregation

Eviction-based LayerKV methods evaluate per-token importance for each layer and select the most significant entries to retain:

  • Robust Scoring and Risk Management: DefensiveKV and Layer-DefensiveKV propose robust linear-time aggregation (worst-case plus prior correction) to avoid the fragility of mean-based importance measures under non-stationary or outlier attention patterns (Feng et al., 15 Oct 2025). The aggregation procedure ensures that rare "spikes" in importance are not lost, thus sharply reducing worst-case generation quality loss.
  • Cascading and Shift-Tolerant Policies: CAKE and LAVa adapt to varying task types and shifting token importances by explicitly tracking mean and variance of attention indicators with a sliding window, enabling both spatial and temporal adaptivity in retained cache entries (Qin et al., 16 Mar 2025, Shen et al., 11 Sep 2025).

Layer-wise eviction methods (especially with mixed or defensive aggregation) dominate previous uniform or purely token-based approaches, commonly halving or quartering memory usage at the same accuracy (Feng et al., 15 Oct 2025, Qin et al., 16 Mar 2025, Shen et al., 11 Sep 2025).

6. Systems Integration and Performance Gains

LayerKV has substantial consequences for serving stacks:

  • Layer-Wise Block Scheduling and Offloading: The LayerKV system (Xiong et al., 2024) partitions KV blocks across CPU and GPU according to a per-layer schedule for each request. It overlaps PCIe transfer for swapped-out layers with on-GPU compute for retained layers, sharply reducing requests' instantaneous demand for KV slots.
  • SLO-Aware and Parallelism-Compatible Design: LayerKV extends standard serving engines (vLLM, FastGen) with a scheduler that respects both time per token (TPOT) and time to first token (TTFT) service level objectives, supporting data, tensor, and pipeline parallelism transparently (Xiong et al., 2024).
  • Throughput and Latency: In heavy-load scenarios, LayerKV achieves up to 69× TTFT reductions, reduces SLO violation rates by up to 28.7%, and often enables a >5× increase in batch size at fixed hardware (e.g., with XKV) or even 26× faster decoding when combined with methods such as LCKV (Xiong et al., 2024, Li et al., 2024, Wu et al., 2024).

7. Comparative Results and Integration with Other Techniques

LayerKV schemes are orthogonal to and compatible with other cache reduction techniques, including:

  • Token Eviction (e.g., SnapKV, Hâ‚‚O): LayerKV can share a single token-eviction mask across merged layers or combine cross-layer sharing with intra-layer eviction for compounded savings (Wang et al., 22 Aug 2025, Yang et al., 2024).
  • Quantization: Post-SVD or post-eviction, on-the-fly quantization (int4, int2, mixed-precision) further compresses cached representations, supporting up to 98% cache reduction with <10% impact on downstream metrics (Wang et al., 22 Aug 2025, Sharma et al., 2024).
  • Fusion with Specialized Architectures: LayerKV shares mechanisms with techniques such as FusedKV, which reconstructs top-layer KV states by learned fusion (e.g., bottom and middle layer mixing) and preserves RoPE properties (Lin et al., 3 Dec 2025).

Across benchmarks such as LongBench, RULER, and NeedleBench, LayerKV methods consistently outperform both static allocation and token-only approaches, especially under tight memory constraints (Feng et al., 15 Oct 2025, Qin et al., 16 Mar 2025, Shen et al., 11 Sep 2025).


In summary, LayerKV refers broadly to a family of methods that minimize the memory and system cost of LLM KV caching through adaptive, layer-wise sharing, eviction, and quantization. These techniques are now foundational to both research and practical deployment of long-context LLMs, supporting state-of-the-art efficiency and fidelity.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LayerKV.