Papers
Topics
Authors
Recent
Search
2000 character limit reached

MiKV: Mixed-Precision KV Compression

Updated 2 June 2026
  • MiKV is a mixed-precision key-value cache compression method that allocates bit-widths based on token importance to reduce transformer memory usage.
  • It partitions tokens into high-importance and retained sets, storing critical tokens at high precision and quantizing others aggressively.
  • Empirical evaluations on Llama-2 and Mistral-7B models show that MiKV preserves over 95% accuracy with significant memory footprint reductions.

The MiKV approach, or Mixed-precision Key-Value cache compression, is an advanced method designed to address the scale-induced memory bottleneck of transformer-based LLM inference. It enables substantial memory reductions for key-value (KV) caches in multi-layer, multi-head attention models by selectively allocating bit-widths to stored cache entries based on token-level importance. MiKV is empirically validated on Llama-2 and Mistral-7B-class LLMs across line retrieval, mathematical reasoning, code generation, knowledge, and instruction-following tasks, achieving state-of-the-art memory-quality tradeoffs over prior eviction or uniform quantization schemes (Yang et al., 2024).

1. Problem Scope and Motivation

Transformer models cache the output of their key and value projection layers during autoregressive generation, resulting in cumulative memory footprints that dramatically exceed model parameter sizes at large sequence lengths and batch sizes. For a transformer with BB batch size, TT sequence length, LL layers, HH heads, and per-head dimension dhd_h, the KV cache requires

Mfull=BTLHdhbytes_per_element2M_{\text{full}} = B\,T\,L\,H\,d_h \cdot \text{bytes\_per\_element} \cdot 2

bytes (factor 2 for both KK and VV). Standard approaches (eviction, uniform quantization, static downsampling) induce severe task degradation under moderate compression. MiKV's central motivation is preserving semantic and safety-critical information for reliable text generation with minimal memory overhead.

2. Importance-Aware Token Partitioning

MiKV utilizes an importance estimator over cached tokens to inform cache management. The core scoring function for token index ii is:

si=l=1Lh=1Ht=i+1TAt,il,hs_i = \sum_{l=1}^L\sum_{h=1}^H\sum_{t=i+1}^T A_{t,i}^{l,h}

where TT0 is the attention placed on token TT1 during the creation of token TT2, in layer TT3, head TT4. This score quantifies each token's influence on subsequent decoding steps. For a user-specified fraction TT5 (e.g., 20%), tokens are ranked by TT6 and split into two disjoint sets:

  • TT7 (important): top TT8 tokens.
  • TT9 (retained): remaining tokens.

This partitioning can leverage external importance heuristics, but the attention-based method directly exploits the transformer’s attribution behavior, ensuring that highly influential tokens remain losslessly represented.

3. Mixed-Precision Quantization Mechanism

For tokens in LL0, MiKV stores their keys and values at high precision (FP16 or 8-bit). For tokens in LL1, KV pairs are aggressively quantized (4- or 2-bit). The quantization for a vector LL2 employs asymmetric rounding:

LL3

Quantized KVs are dequantized by the inverse transformation during attention computation.

Critical to avoiding quantization-induced degradation is outlier-aware channel balancing: per-layer, head, and channel balancing factor LL4 is calculated during the prefill phase as

LL5

Such channel-wise rescaling shifts outlier-magnitude variance from LL6 to LL7, enabling accurate quantization at very low bit-widths (LL8 2-bit) for non-important tokens while preserving softmax computational fidelity, as queries remain at high precision.

4. Operational Workflow and Implementation

The MiKV pipeline is composed of:

  • Prefill Phase: Compute channel maximums and balancing factors for each (layer, head, channel). Score prompt tokens by importance.
  • Generation Loop: For each new token, compute and store its KV at full or quantized precision per its assignment in LL9 or HH0. Quantize non-important tokens using per-channel scaling.
  • Inference: During attention, high-importance caches are accessed at high precision, while low-importance entries are dequantized prior to matrix multiplication.

A high-level pseudocode overview is as follows: HH5 This design avoids information loss from hard eviction and the context loss typical of static quantization, with negligible runtime impact.

5. Memory-Quality Tradeoffs and Quantitative Results

MiKV was benchmarked on Llama-2-7B/13B/70B-chat and Mistral-7B-Instruct backbones and evaluated with Line Retrieval, GSM8K, HumanEval, MMLU, and AlpacaEval metrics:

  • Line Retrieval: At 50% cache memory and INT4 quantization for retained, MiKV preserves 100% accuracy vs. HH140% for pure eviction.
  • General Tasks (GSM8K, HumanEval, MMLU): Uniform quantization schemes at 20% cache incur HH250% accuracy drop, while MiKV at the same ratio retains over 95% of baseline performance.
  • Memory Footprint: For batch-8, seq-4096 Llama-2-7B:
    • Full cache: 34.36 GB
    • 25% MiKV: 8.59 GB (HH3acc ≈ 0.1%)
    • 20% MiKV: 6.87 GB (HH4acc < 2%)
  • Effect of Aggressive Quantization: Retaining in INT4 at 50% cache recovers full accuracy; INT2 at 50% yields 84.6% accuracy, improved to 92.6% (at 32% cache) with outlier scaling.
  • Importance-Precision Ablation: Retained cache in INT2; importance cache in INT8 achieves 92.4% accuracy at 23% cache size.

These results consistently show that importance-aware mixed-precision outperforms uniform or eviction baselines.

6. Safety, Context, and Limitations

MiKV’s strategy of lossy retention (compressed but not evicted) for non-crucial KVs mitigates context loss and safety hazards observed in hard-eviction methods, such as loss of system prompts or factual grounding leading to hallucination and jailbreaking. Synthetic “Line Retrieval” benchmarks and open-ended generative tasks both demonstrate that even small amounts of low-precision retention suffice to preserve semantic fidelity, an effect unattainable with pure selection-based eviction.

Current limitations include performance degradation if retained KVs are quantized to INT1. The importance-policy is static after prefill; dynamic re-scoring during long generations remains an open research direction. Variable bit-width allocation—particularly per head or per channel—represents a direction for further optimization.

7. Future Research and Extensions

Proposed improvements include:

  • Rate–distortion optimized bit-width allocation at per-head/layer granularity.
  • Hardware-oriented kernel developments for high-throughput per-channel mixed-precision attention.
  • Cross-modal and multi-document cache management strategies.
  • Dynamic, context-aware token importance tracking to reflect shifting relevance during long-form or streaming generation.

MiKV, by jointly leveraging principled importance estimation, precision adaptation, and channel balancing, establishes a new Pareto frontier for practical, quality-preserving cache compression in large-scale LLM inference (Yang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MiKV Approach.