MiKV: Mixed-Precision KV Compression
- MiKV is a mixed-precision key-value cache compression method that allocates bit-widths based on token importance to reduce transformer memory usage.
- It partitions tokens into high-importance and retained sets, storing critical tokens at high precision and quantizing others aggressively.
- Empirical evaluations on Llama-2 and Mistral-7B models show that MiKV preserves over 95% accuracy with significant memory footprint reductions.
The MiKV approach, or Mixed-precision Key-Value cache compression, is an advanced method designed to address the scale-induced memory bottleneck of transformer-based LLM inference. It enables substantial memory reductions for key-value (KV) caches in multi-layer, multi-head attention models by selectively allocating bit-widths to stored cache entries based on token-level importance. MiKV is empirically validated on Llama-2 and Mistral-7B-class LLMs across line retrieval, mathematical reasoning, code generation, knowledge, and instruction-following tasks, achieving state-of-the-art memory-quality tradeoffs over prior eviction or uniform quantization schemes (Yang et al., 2024).
1. Problem Scope and Motivation
Transformer models cache the output of their key and value projection layers during autoregressive generation, resulting in cumulative memory footprints that dramatically exceed model parameter sizes at large sequence lengths and batch sizes. For a transformer with batch size, sequence length, layers, heads, and per-head dimension , the KV cache requires
bytes (factor 2 for both and ). Standard approaches (eviction, uniform quantization, static downsampling) induce severe task degradation under moderate compression. MiKV's central motivation is preserving semantic and safety-critical information for reliable text generation with minimal memory overhead.
2. Importance-Aware Token Partitioning
MiKV utilizes an importance estimator over cached tokens to inform cache management. The core scoring function for token index is:
where 0 is the attention placed on token 1 during the creation of token 2, in layer 3, head 4. This score quantifies each token's influence on subsequent decoding steps. For a user-specified fraction 5 (e.g., 20%), tokens are ranked by 6 and split into two disjoint sets:
- 7 (important): top 8 tokens.
- 9 (retained): remaining tokens.
This partitioning can leverage external importance heuristics, but the attention-based method directly exploits the transformer’s attribution behavior, ensuring that highly influential tokens remain losslessly represented.
3. Mixed-Precision Quantization Mechanism
For tokens in 0, MiKV stores their keys and values at high precision (FP16 or 8-bit). For tokens in 1, KV pairs are aggressively quantized (4- or 2-bit). The quantization for a vector 2 employs asymmetric rounding:
3
Quantized KVs are dequantized by the inverse transformation during attention computation.
Critical to avoiding quantization-induced degradation is outlier-aware channel balancing: per-layer, head, and channel balancing factor 4 is calculated during the prefill phase as
5
Such channel-wise rescaling shifts outlier-magnitude variance from 6 to 7, enabling accurate quantization at very low bit-widths (8 2-bit) for non-important tokens while preserving softmax computational fidelity, as queries remain at high precision.
4. Operational Workflow and Implementation
The MiKV pipeline is composed of:
- Prefill Phase: Compute channel maximums and balancing factors for each (layer, head, channel). Score prompt tokens by importance.
- Generation Loop: For each new token, compute and store its KV at full or quantized precision per its assignment in 9 or 0. Quantize non-important tokens using per-channel scaling.
- Inference: During attention, high-importance caches are accessed at high precision, while low-importance entries are dequantized prior to matrix multiplication.
A high-level pseudocode overview is as follows: 5 This design avoids information loss from hard eviction and the context loss typical of static quantization, with negligible runtime impact.
5. Memory-Quality Tradeoffs and Quantitative Results
MiKV was benchmarked on Llama-2-7B/13B/70B-chat and Mistral-7B-Instruct backbones and evaluated with Line Retrieval, GSM8K, HumanEval, MMLU, and AlpacaEval metrics:
- Line Retrieval: At 50% cache memory and INT4 quantization for retained, MiKV preserves 100% accuracy vs. 140% for pure eviction.
- General Tasks (GSM8K, HumanEval, MMLU): Uniform quantization schemes at 20% cache incur 250% accuracy drop, while MiKV at the same ratio retains over 95% of baseline performance.
- Memory Footprint: For batch-8, seq-4096 Llama-2-7B:
- Full cache: 34.36 GB
- 25% MiKV: 8.59 GB (3acc ≈ 0.1%)
- 20% MiKV: 6.87 GB (4acc < 2%)
- Effect of Aggressive Quantization: Retaining in INT4 at 50% cache recovers full accuracy; INT2 at 50% yields 84.6% accuracy, improved to 92.6% (at 32% cache) with outlier scaling.
- Importance-Precision Ablation: Retained cache in INT2; importance cache in INT8 achieves 92.4% accuracy at 23% cache size.
These results consistently show that importance-aware mixed-precision outperforms uniform or eviction baselines.
6. Safety, Context, and Limitations
MiKV’s strategy of lossy retention (compressed but not evicted) for non-crucial KVs mitigates context loss and safety hazards observed in hard-eviction methods, such as loss of system prompts or factual grounding leading to hallucination and jailbreaking. Synthetic “Line Retrieval” benchmarks and open-ended generative tasks both demonstrate that even small amounts of low-precision retention suffice to preserve semantic fidelity, an effect unattainable with pure selection-based eviction.
Current limitations include performance degradation if retained KVs are quantized to INT1. The importance-policy is static after prefill; dynamic re-scoring during long generations remains an open research direction. Variable bit-width allocation—particularly per head or per channel—represents a direction for further optimization.
7. Future Research and Extensions
Proposed improvements include:
- Rate–distortion optimized bit-width allocation at per-head/layer granularity.
- Hardware-oriented kernel developments for high-throughput per-channel mixed-precision attention.
- Cross-modal and multi-document cache management strategies.
- Dynamic, context-aware token importance tracking to reflect shifting relevance during long-form or streaming generation.
MiKV, by jointly leveraging principled importance estimation, precision adaptation, and channel balancing, establishes a new Pareto frontier for practical, quality-preserving cache compression in large-scale LLM inference (Yang et al., 2024).