KVComp: LLM-Aware KV Cache Compression
- KVComp is a high-performance, LLM-aware, lossy compression framework that reduces memory usage in transformer KV caches using error-bounded quantization and GPU-optimized entropy coding.
- The system fuses in-situ decompression with matrix–vector multiplication to minimize global memory movement and maintain high throughput during attention computation.
- Empirical results show up to an 83% reduction in KV cache size with less than 3% accuracy loss, enabling efficient long-context LLM inference.
KVComp refers to a family of high-performance, LLM-aware, lossy compression frameworks for the Key-Value (KV) cache used in transformer-based LLMs. The primary objective of KVComp systems is to substantially decrease the memory footprint of the KV cache during long-context inference without meaningful degradation in model accuracy or throughput, exploiting the statistical regularities in KV tensors and fusing decompression with the attention computation pipeline (Jiang et al., 30 Aug 2025).
1. Memory Bottleneck of KV Cache in LLMs
During autoregressive transformer inference, each new token appends a new key and value vector to the per-layer, per-head KV cache. As the context window and batch size grow, the KV cache’s memory requirements scale linearly with the number of stored tokens and can quickly surpass that of the model weights. For example, a LLaMA2-30B inference run with a context length of 32K and batch size 8 requires over 100 GB of fp16 KV cache, exceeding the 60 GB of parameters. Traditional solutions such as quantization, token-pruning, or KV cache migration to CPU have exhibited limited effectiveness due to compression ratio limits, unpredictable accuracy degradation, or bandwidth/scheduling bottlenecks (Jiang et al., 30 Aug 2025).
2. Compression Algorithms: Quantization and Entropy Coding
KVComp leverages two-stage compression:
- Block-wise, channel-specific quantization: The KV cache tensor is partitioned either into 2D blocks (keys: [block size, head_dim], per-head scaling within each block) or into token-wise vectors (values: per-token quantization). Quantization follows
with , yielding a relative quantization error bound .
- GPU-optimized entropy coding (Huffman): Quantized codes are histogrammed per layer, then a static Huffman codebook is built. Each block is encoded in parallel, with coalesced memory accesses, yielding a compression ratio
where is code frequency and prefix length. These techniques exploit the highly skewed statistical distribution of LLM KV caches, often centered around a small set of levels.
3. System and Kernel Co-Design for Throughput
To maximize efficiency, KVComp fuses decompression and matrix–vector multiplication (MVM) required by attention, eliminating the need to materialize decompressed KV in global memory, amortizing decompression overhead, and minimizing redundant data movement. This system is fundamentally GPU-resident—during prefill, each K/V block is compressed and appended to the compressed cache; during decode, a fused kernel per block loads compressed codes, decodes (branchless Huffman traversal), dequantizes in registers, immediately computes the dot-product with query, and writes only the final output. Decompression and MVM are performed in-situ, and atomic writes avoid kernel launch bottlenecks.
| Phase | Key Operations | Memory I/O |
|---|---|---|
| Store | quantize, entropy code | append compressed, record offsets |
| Fetch | decode, dequant, MVM (fused) | compressed in, output only attend-out |
Empirically, the decoded K/V data is never stored as a whole tensor, and global memory movement is minimized.
4. Empirical Results: Compression Ratios and Latency
KVComp, benchmarked on LLaMA2-7B/13B, Mistral-8B, and real-world tasks (CoQA, GSM8K) consistently reduces cache size by 47% on average and up to 83% versus float16 baseline. Under equivalent accuracy, it yields 32–62% higher compression than quantization methods like KIVI. At –6% (keys) and –20% (values), top-1 EM/F1 and perplexity scores degrade by less than 3%, with typically negligible end-to-end effect on task accuracy.
Throughput measurements show that standalone Huffman decode plus dequant and GEMV are limited by memory bandwidth (<100 GB/s), whereas the fully fused (decode+dequant+MVM) kernel achieves up to 400 GB/s (K) and 180 GB/s (V) on RTX 4090, matching or exceeding cuBLAS GEMV for contexts ≥8K by reducing global memory bandwidth pressure. In long-context conditions, the overhead from decompression becomes negligible or negative; in many cases, fused KVComp is faster than uncompressed inference because less data is transferred.
5. Trade-offs and Design Guidelines
Aggressive quantization on K impacts softmax output and hence attention fidelity; moderate scales are optimal. Recommended settings based on observed impact are:
- –$0.06$
- –$0.20$ These settings keep accuracy loss typically below 3%.
Block size and buffer_size control kernel launch overhead and atomic operation contention: moderate sizes –128 and buffer_size ≈512 tokens per layer balance throughput and latency. For latency-critical use (single-token generation), prefer smaller ; for throughput-oriented (batch decode) larger is preferred.
6. Extensions and Comparison with Related Methods
KVComp distinguishes itself from alternatives in several dimensions:
- Quantization-only methods: Without entropy coding/fused kernels, quantization achieves typically compression. KVComp’s entropy coding leverages distribution skew for much higher ratios and integrates efficiently with inference.
- Pruning/eviction approaches: Pruning can unpredictably degrade attention and require re-computation; KVComp maintains predictable error bounds via quantization.
- CPU offloading: Standard migration of KV to CPU involves severe PCIe latency, detrimental for real-time inference.
Other frameworks such as PackKV (Jiang et al., 30 Dec 2025) extend the fused approach with further optimizations in bit-packing and pack-size adaptation, and are generally compatible in goal and architecture. SVD-based or low-rank decompositions (e.g., KQ-SVD (Lesens et al., 5 Dec 2025), FDC (Zhang et al., 2024), CommonKV (Wang et al., 22 Aug 2025)) focus on structural redundancy, while KVComp exploits value-level redundancy and GPU entropy characteristics.
7. Summary and Practical Impact
KVComp represents a family of LLM-aware, GPU-optimized, lossy compression frameworks for the dynamic KV cache. By combining error-bounded quantization, context-adaptive entropy coding, and architecture-aware, in-situ decompression-scheduling, it achieves dramatic (up to ) memory reductions with little/no accuracy loss, and often with improved compute throughput over standard attention implementations (Jiang et al., 30 Aug 2025). This enables efficient, long-context LLM inference for large models in both latency- and throughput-constrained environments, and sets a best-practice baseline for future KV cache compression research.