Papers
Topics
Authors
Recent
Search
2000 character limit reached

MiniKV: Efficient KV Cache Compression

Updated 10 February 2026
  • MiniKV is a memory-efficient technique that compresses transformer key-value caches using aggressive 2-bit quantization or low-rank approximations.
  • It integrates layer-discriminative heavy-hitter selection and fused CUDA kernels to dequantize on-the-fly, significantly reducing GPU memory usage.
  • The approach, including Grouped Query Attention variants, achieves substantial cache width reduction and maintains >98.5% baseline accuracy with improved throughput.

MiniKV is a class of memory-efficient Key-Value (KV) cache compression techniques for transformer-based LLMs, motivated by the need to enable fast, accurate inference on long-context tasks under severe GPU memory constraints. MiniKV combines aggressive quantization or low-rank approximation of the KV cache with algorithmic and systems-level innovations to dramatically reduce memory footprint while preserving accuracy and system throughput. The name “MiniKV” has been used for both a family of low-rank cache head compression techniques based on Grouped Query Attention (GQA) (Yu et al., 2024, Yan et al., 30 May 2025) and for a 2-bit layer-discriminative quantization scheme that outperforms prior quantizers while enabling high-throughput inference with long contexts (Sharma et al., 2024).

1. Motivation and Problem Definition

In LLM inference, the KV cache stores all past sequence Key (KhRt×dK_h \in \mathbb{R}^{t \times d}) and Value (VhRt×dV_h \in \mathbb{R}^{t \times d}) tensors for each decoder layer hh, growing linearly with the sum of prompt (LpromptL_\text{prompt}) and generation (LgenL_\text{gen}) lengths, model depth HH, and hidden size dd: #bytesKV=2×H×d×(Lprompt+Lgen)×2 bytes  (FP16)\#\text{bytes}_\text{KV} = 2 \times H \times d \times (L_\text{prompt} + L_\text{gen}) \times 2~\text{bytes\;(FP16)} This rapid growth dominates GPU memory, limiting batch size and context length, and making efficient inference challenging for long-context LLM tasks. The goal of MiniKV approaches is to achieve high compression (target: 2-bit quantization or 50–75% cache width reduction) while recovering nearly all original accuracy, thus unlocking single-GPU long-context inference.

2. 2-Bit Layer-Discriminative Quantization

One line of MiniKV research implements an inference-only, 2-bit subchannel quantization applied in a layer-discriminative fashion (Sharma et al., 2024). Key technical aspects:

  • Heavy-hitter Selection: After a prompt-prefill phase, statically select a set of “heavy-hitter” token positions ShS_h per layer; these persist through decoding and receive dedicated quantization.
  • Quantization Method: Each block of gg values in a selected sub-tensor is quantized as:
    • Compute xminx_\text{min}, xmaxx_\text{max} in block; scale Δ=(xmaxxmin)/(2b1)\Delta = (x_\text{max} - x_\text{min})/(2^b-1) for b=2b=2.
    • Quantize xix_i: qi=clip(round(xi/Δ)+z,0,3)q_i = \mathrm{clip}(\mathrm{round}(x_i/\Delta) + z, 0, 3).
    • Dequantize: x^i=(qiz)Δ\hat{x}_i = (q_i-z)\Delta.
    • Sixteen 2-bit values are packed into one 32-bit word.
  • Layer-Discriminative Allocation: Budget for heavy-hitter and recent window tokens (αHH,αRW\alpha_{HH}, \alpha_{RW}) is tuned per layer; empirically, “Pyramid” allocation (bigger budget in lower layers) performs best.
  • Integration with FlashAttention:
    • Selective FlashAttention: Modified kernel tracks cumulative softmax scores, enabling heavy-hitter selection during prefill.
    • Quantized Attention: Fused CUDA kernel dequantizes 2-bit blocks on-the-fly and applies matrix-vector multiplies, eliminating extra memory passes.

Quantization error is bounded by xix^i12Δ|x_i - \hat{x}_i| \leq \frac{1}{2}\Delta, and the effect on QKQK^\top is small if Δ\Delta is layer-wise tuned.

3. Low-Rank KV Cache and Head Compression (Grouped-Query Attention)

Earlier and co-evolving lines under the MiniKV name implement cache compression by leveraging empirical low-rank structure in multi-head KV matrices (Yu et al., 2024, Yan et al., 30 May 2025). The key procedures are:

  • Grouping and Factorization: Partition the hh heads into gg groups (t=h/gt = h/g), compute joint Gram matrices for K,VK,V per group over calibration data, perform eigendecomposition or SVD, and select basis Ψi,r,Ωi,r\Psi_{i,r},\,\Omega_{i,r}.
  • Weight Fusion: Replace original projection matrices with low-rank fused weights:

W^K,i=[WKit,,WKit+t1]Ψi,r\widehat{W}_{K,i} = [W_{K_{it}}, \ldots, W_{K_{it+t-1}}] \Psi_{i,r}^\top

W^V,i=[WVit,,WVit+t1]Ωi,r\widehat{W}_{V,i} = [W_{V_{it}}, \ldots, W_{V_{it+t-1}}] \Omega_{i,r}^\top

WQ,WOW_Q, W_O are similarly updated for structural consistency, yielding an equivalent Grouped-Query Attention (GQA) model.

  • Rotary Position Embeddings: Basis for KK incorporates RoPE effects; key values are projected after applying rotation for correctness.
  • Post-Compression Tuning: One or two epochs of low-rank adaptation (LoRA) or full-parameter fine-tuning can recover nearly all task performance.

The resulting GQA structure enables cache width reduction (and thus memory savings) by g/hg/h, translating directly to faster decode throughput.

4. Theoretical and Empirical Performance Analysis

Compression performance is characterized by memory, accuracy, and efficiency metrics:

  • 2-Bit MiniKV (Sharma et al., 2024): Achieves 86%86\% compression (0.14\approx0.14 of FP16 cache size), with >>98.5% of baseline accuracy on LongBench and similar benchmarks. The Pyramid allocation policy yields superior Pareto efficiency versus uniform/variance-based allocations.
  • Low-Rank MiniKV (Yu et al., 2024, Yan et al., 30 May 2025): Keeping half or a quarter of KV heads achieves 2×2\times4×4\times cache reduction, with <1.5 percentage point accuracy drop post-fine-tuning. On LLaMA2-7B (321632\to 16 heads), throughput increased by 66% (8.0513.418.05 \to 13.41 tokens/s), and memory usage tracks the KV head reduction.
  • ReCalKV comparison (Yan et al., 30 May 2025): Head-wise Similarity-aware Reordering (HSR) outperforms naive grouping, and Offline Calibration and Matrix Fusion (OCMF) for value compression further reduces computational overhead while retaining <7% perplexity increase at 50–70% cache compression.

These results position MiniKV at the optimal accuracy–compression Pareto frontier for LLM KV cache compression.

5. Implementation, Integration, and Operational Constraints

MiniKV is an inference-only approach—no additional training or fine-tuning required for 2-bit quantization (Sharma et al., 2024); post-hoc fine-tuning is optional for low-rank GQA-based MiniKV (Yu et al., 2024). Key implementation details:

  • Two-Phase Inference: Prefill for heavy-hitter selection and quantization, then decode with mixed-precision/fused kernels (Sharma et al., 2024).
  • CUDA/FlashAttention Integration: Custom kernels fuse quantization and attention, using shared memory tiling and in-kernel accumulation. Peak memory remains O(Ld)O(Ld).
  • Layer/Budget Tuning: Layer-discriminative allocation is necessary for maximizing accuracy under fixed memory; uniform policies yield non-optimal tradeoffs.
  • Supported Contexts and Latency: On A100 GPUs, MiniKV enables >40k token contexts (vs. 32k for FP16 and 40k for prior quantizers). Latency decreases by 10–30% at long contexts due to reduced memory movement.

The static nature of heavy-hitter selection may miss rare dynamic shifts in token importance, although in practice this rarely degrades standard tasks.

6. Applications, Benchmarks, and Limitations

MiniKV has been evaluated on:

  • Tasks: LongBench (single-/multi-doc QA, synthetic, code, summarization, few-shot), zero-shot QA, WikiText2, PTB.
  • Models: LLaMA2-7B/13B, Mistral-7B, BLOOMZ-7B1.
  • Baselines: FP16, KIVI (2-bit quantization), H₂O, SnapKV (selective quantization), Palu, ReCalKV, GQA.

MiniKV supports direct plug-in for any FlashAttention-based pipeline (2-bit variant) or models where weight fusion is feasible (GQA variant). Practical limitations include:

  • Static heavy-hitter set (may be suboptimal if token relevance shifts mid-generation),
  • Requirement for careful per-layer budget tuning,
  • Fixed quantization group size (overly small groups may degrade accuracy or increase metadata overhead).

7. Future Research Directions

Suggested advancements include:

  • Adaptive heavy-hitter set updates during generation,
  • Joint optimization of quantization group size versus performance,
  • Extensions to mixture-of-experts/routing-transformer architectures.

The combination of layer-aware, aggressive quantization or low-rank approximation, task-aware budget scheduling, and custom systems integration establishes MiniKV as a high-performance solution for KV cache compression in long-context and high-throughput LLM inference (Sharma et al., 2024, Yu et al., 2024, Yan et al., 30 May 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MiniKV.