KeyComp: Transformer KV Cache Compression
- KeyComp is an umbrella term for methods that compress transformer key-value caches, reducing memory overhead while preserving quality.
- It encompasses strategies like quantization, pruning, token selection, and matrix factorization applied at various model dimensions.
- Empirical studies show techniques such as DapQ, CompilerKV, and DecoQuant achieve up to 90% compression with minimal impact on model performance.
KeyComp
KeyComp is an umbrella term (Editor’s term) for Key–Value (KV) cache compression: a set of algorithmic strategies and frameworks designed to reduce the memory and computational overhead of storing KV tensors in large transformer models. In transformer-based LLMs, the KV cache enables efficient self-attention during generation by storing per-token key/value vectors, but its linear growth with context length leads to significant memory bottlenecks. KeyComp encompasses a rich landscape of methodologies—including quantization, pruning, token selection, head grouping, matrix factorization, and prompt- or retrieval-aware summarization—each exploiting distinct structural or statistical properties of LLM activations to minimize the cache footprint while retaining model quality, capability, and efficiency (Javidnia et al., 14 Mar 2025).
1. Motivation and Problem Definition
The transformer attention mechanism computes, at each decoding step , an attention output by comparing a "query" vector with all previously cached "key" vectors and aggregating their associated "value" vectors. Caching keys () and values () for every past token avoids recomputation, reducing per-step complexity from to , but imposes a storage cost that grows as per layer, where is the context length and is the hidden dimension (Tian et al., 12 Mar 2026, Javidnia et al., 14 Mar 2025). For long-context or multi-turn generation, the KV cache size can dominate GPU and system memory, creating bottlenecks in latency, throughput, and practical deployment.
KeyComp methods aim to address the following core challenges:
- Memory Bottleneck: Halting linear scaling of KV memory for very long contexts (e.g., thousands to hundreds of thousands of tokens).
- Latency Constraints: Maintaining or improving generation speed and time-to-first-token metrics.
- Quality Preservation: Achieving compression/eviction with minimal or tolerable degradation in output quality, such as accuracy or perplexity.
The field is motivated by rapidly emerging use cases including multi-document summarization, retrieval-augmented generation, in-context learning, and streaming inference—all of which exacerbate cache pressure and benefit from KeyComp (Javidnia et al., 14 Mar 2025).
2. Taxonomy of KeyComp Techniques
KeyComp is systematically categorized according to the dimension of the KV cache targeted—along layers, heads, tokens, or hidden state—and by integration strategy (scratch-trained, post-hoc, or zero-training)(Javidnia et al., 14 Mar 2025).
A. Layer-wise Compression
- Cross-layer sharing: YOCO shares the KV cache across consecutive layers, halving total KV memory (Javidnia et al., 14 Mar 2025).
- Layer pruning: Attention-Drop selectively prunes attention modules based on cosine similarity criteria for redundancy detection.
B. Head-wise Compression
- Multi-query Attention (MQA): All heads share a single KV cache, reducing per-layer cache by a factor of where is the number of heads.
- Grouped-Query Attention (GQA): Heads are partitioned into groups, each sharing its own KV cache—a trade-off between diversity and memory usage (Javidnia et al., 14 Mar 2025).
C. Token-wise Compression
- Token pruning/eviction: Methods like SnapKV, PyramidKV, DapQ, CompilerKV, and CurDKV identify "important" tokens for future decoding, often using aggregate attention, contextual scores, or matrix factorization (Tian et al., 12 Mar 2026, Yang et al., 9 Feb 2026, Sengupta et al., 18 Sep 2025).
- State-space models: Mamba and related architectures sidestep explicit KV caches by recurrently updating hidden states (Javidnia et al., 14 Mar 2025).
D. Hidden-dimension Compression
- Quantization: Keys and values are stored at reduced numerical precision (e.g., 4- or 8-bit), employing symmetric quantization, tensor decomposition, and outlier migration strategies (KeyComp in DecoQuant) (Liu et al., 2024).
E. Prompt-aware Summarization (K-COMP)
- Retrieval and knowledge-injection: K-COMP in medical QA domains injects domain knowledge (e.g., entity definitions) and autoregressively compresses retrieved passages to minimize context size yet maintain alignment with question intent (Cho et al., 23 Jan 2025).
A summary table of major families:
| Dimension | Examples | Typical Methodology |
|---|---|---|
| Layer | YOCO, Attention-Drop | Sharing, pruning |
| Head | MQA, GQA, MLA | Shared, grouped, low-rank |
| Token | SnapKV, DapQ, CompilerKV, CurDKV | Importance scoring, CUR, RL |
| Hidden | Quantization, DecoQuant, KeyComp-MPO | Low-bit quant, tensor decomp. |
| Prompt/semantic | K-COMP, LLMLingua, CPC | Masking, reranking, summarization |
3. Core Methodologies
Most KeyComp techniques operate by directly targeting the key structural and statistical determinants of attention and memory.
A. Decoding-aligned Pruning (DapQ)
DapQ introduces position-aware pseudo queries, appending synthetic tokens with future positions to the prompt and using their queries to compute an importance score over keys. The key empirical insight is that positional embeddings dominate query behavior post-RoPE, enabling precise prediction of which tokens will be attended during decoding (Tian et al., 12 Mar 2026). The algorithm therefore simulates the “future” queries that arise during generation, aligning token retention decisions with real decoder needs.
B. Risk-Adaptive Compression (CompilerKV)
CompilerKV models KV compression as a one-shot decision, incorporating both prompt-level risk (via attention entropy and local perplexity) and attention head heterogeneity (offline-learned reliability weights). Token importance is computed from window-cumulative attention and normalized value magnitudes, aggregated through weighted max pooling per head. Risk-adaptive thresholds are then determined using precompiled bandit-learned tables (Yang et al., 9 Feb 2026).
C. Value-Guided CUR Decomposition (CurDKV)
CurDKV employs a matrix decomposition approach, targeting the optimal approximation of the attention output . Leverage scores (for both and ) are approximated using random projections, and the subset maximizing combined leverage is retained. Theoretical analysis bounds the output error and empirical results show that CurDKV preserves generation accuracy and reduces end-to-end latency beyond attention-only token pruning (Sengupta et al., 18 Sep 2025).
D. Tensor-decomposition-based Quantization (KeyComp/DecoQuant)
KeyComp in DecoQuant uses a Matrix-Product-Operator (MPO) decomposition to isolate outlier values into a small "skinny" tensor and leaves the main ("fat") tensor amenable to ultra-low-bit quantization. Only the large tensor is quantized to as few as 2-4 bits, yielding 75%+ memory savings and 1.25x speedup on long-sequence decoding tasks with minimal or no quality loss (Liu et al., 2024).
E. Prompt-aware Compression in Retrieval-Augmented QA (K-COMP)
K-COMP incorporates prior knowledge by autoregressively generating entity spans and short definitions from retrieved passages and conditioning summary compression on these augmented tokens. This is especially designed to bridge the domain-expertise gap in medical QA and to avoid contextual noise from irrelevant retrievals (Cho et al., 23 Jan 2025).
4. Theoretical Guarantees and Empirical Performance
Robustness and effectiveness of KeyComp algorithms are evaluated on proxy tasks such as Needle-in-a-Haystack (NIAH), LongBench, HELMET, RULER, and domain-specific QA datasets.
- DapQ: Achieves 99.46% accuracy (loss of only 0.54%) on NIAH with 3% KV budget, outperforming SnapKV, PyramidKV, and H2O by large margins. Minimal overhead () in throughput and memory (Tian et al., 12 Mar 2026).
- CompilerKV: Maintains 97.7% of FullKV performance under a 512-token cache, outperforming dynamic and static baselines by up to +5.2 points, and exhibits greatest robustness on complex summarization tasks and high-entropy prompts (Yang et al., 9 Feb 2026).
- CurDKV: Provides up to 9.6% higher accuracy than SnapKV and ChunkKV on aggressive compression settings while reducing latency up to 40%. Compression ratios as high as 80–90% yield only minor degradations (Sengupta et al., 18 Sep 2025).
- DecoQuant/KeyComp: Reduces per-layer KV from 46.7MB to 11.7MB (~75% savings) at negligible (<1%) perplexity loss in LLMs. Gains apply uniformly across zero-shot and few-shot settings (Liu et al., 2024).
- K-COMP: In retrieval-augmented medical QA, outperforms both raw retrieval and state-of-the-art compressors by 7–10 BertScore points, and boosts factual alignment and reader trust (Cho et al., 23 Jan 2025).
A summary of experimental outcomes:
| Method | Context | Compression Ratio | Quality Loss | Throughput Gain |
|---|---|---|---|---|
| DapQ | 8k+ | to 3% KV | <0.6% | ~1x |
| CurDKV | 128k | up to 90% | ≤5% | up to 1.4x |
| DecoQuant | 6k | 75% memory | <1% | 1.25x |
| CompilerKV | 512 toks | ~90% | ~2% | ~1x |
5. Domain-specific and Application-driven Extensions
KeyComp methodologies have been extended and adapted for:
- Retrieval-Augmented Generation: K-COMP's prior-knowledge injection addresses trust and relevance in domain-specific QA, guiding LLMs through concise, context-aligned summaries (Cho et al., 23 Jan 2025).
- Post-quantum Cryptography: In the separate context of CRYSTALS-Kyber, optimal data quantization minimizes communication expansion, using Lloyd-Max quantizers and BCH code encoding to reduce ciphertext expansion rate by 54% while preserving security properties (Liu et al., 2024).
- Training-free and plug-in approaches: LongLLMLingua, CPC, and other prompt compression tools offer rapid, model-agnostic pruning, with gains in speed and sometimes even accuracy, at zero additional model training (Javidnia et al., 14 Mar 2025).
Potential extensions include layer-wise pseudo queries, learned semantic pseudo-token content, and budget adaptation by model confidence or prompt characteristics (Tian et al., 12 Mar 2026).
6. Trade-offs, Limitations, and Best-Practice Guidelines
The KeyComp design space entails nuanced trade-offs:
- Accuracy vs. Compression: By combining techniques—quantization, pruning, head grouping, low-rank projections—practitioners can tune memory vs. loss curves. For example, DapQ and CompilerKV maximize compression given tight budgets.
- Overhead: Algorithms like DapQ and CompilerKV add at most a single extra forward pass during prefill or lookup/sorting, keeping decoding near original speed.
- Robustness: Head-aware and risk-adaptive methods (CompilerKV) defend against performance tail failures in complex or adversarial contexts, where static Top-K schemes break down (Yang et al., 9 Feb 2026).
- Integration: FlashAttention and Grouped Query Attention compatibility, as well as model-agnostic quantization kernels, lower engineering cost.
Best-practice workflow (Javidnia et al., 14 Mar 2025):
- Apply low-bit quantization for immediate memory reduction.
- Exploit architectural retraining (e.g., SSM or cross-layer sharing) when feasible.
- Use post-training or plug-in compression for checkpointed LLMs, balancing head and token pruning.
- For hardest constraints, combine token pruning with prompt-aware summarization and dynamic cache sizing.
Practitioners are advised to calibrate the compression ratio (typically 30–50% retention as a sweet spot), and reserve "attention sinks" or prompt anchors to stabilize token coverage (Sengupta et al., 18 Sep 2025).
7. Future Directions
Open problems include:
- Dynamic and context-adaptive strategies: Extending dynamic adjustment of window size or layer-specific retention based on runtime entropy or confidence.
- Broader domain and multilingual extension: Applying strategies like K-COMP outside English medical QA, leveraging multilingual knowledge graphs.
- Joint retriever-compressor optimization: End-to-end fine-tuning of retrieval and compression modules, especially where entity recognition and prior-knowledge generation are imperfect.
As KeyComp methodologies mature, the synthesis of architectural, statistical, and RL-based approaches continues to enable longer-context, higher-quality, and more memory- and compute-efficient transformer deployments across diverse NLP scenarios (Javidnia et al., 14 Mar 2025, Tian et al., 12 Mar 2026, Yang et al., 9 Feb 2026).