Papers
Topics
Authors
Recent
Search
2000 character limit reached

PackKV: Efficient KV Cache Compression Framework

Updated 6 January 2026
  • PackKV is a GPU-resident framework that employs token-aware lossy quantization and bit-packing to reduce Key–Value cache memory in transformer LLMs.
  • It integrates a five-stage pipeline—buffering, quantization, encode-aware repacking, bit-packing, and seamless appending—to optimize cache access and throughput.
  • Experimental results show up to 18.7× reduction in GPU memory usage and 2.7× throughput gains, maintaining accuracy within defined tolerances.

PackKV is a high-throughput Key–Value (KV) cache compression and management framework specifically designed to alleviate the memory bottlenecks in transformer-based LLMs during long-context inference. By integrating LLM-aware lossy quantization, token-aware bit-packing, and a fused decompression-compute pipeline, PackKV achieves order-of-magnitude reductions in GPU memory requirements and substantial end-to-end throughput gains, without compromising model accuracy under user-defined tolerances (Jiang et al., 30 Dec 2025).

1. Motivation and Background

In standard transformer-based LLMs such as LLaMA and GPT, autoregressive decoding requires maintenance of a KV cache at every layer. For a context of length TT, batch size BB, number of attention heads HH, and head dimension DD, the per-model cache shape is [B,T,H,D][B, T, H, D] for each of the K and V arrays. As TT and BB scale, especially with modern 8B–30B parameter models and extended context windows (e.g., T=32KT=32\text{K}), the KV cache easily exceeds 100 GB in FP16, substantially exceeding model parameter memory. This explodes DRAM utilization, limiting both achievable context lengths and batch sizes. Furthermore, during decoding, the cache is predominantly read through memory-bound matrix–vector multiplications (e.g., 93.7% of GPU time at 100K context), making KV cache access the critical system bottleneck. Existing quantization and pruning solutions yield at best moderate compression and do not eliminate corresponding bandwidth or decompression costs, thus motivating the design objectives of PackKV (Jiang et al., 30 Dec 2025).

2. Framework Architecture

PackKV implements a five-stage, GPU-resident pipeline that interfaces transparently between the model’s KV cache generation and attention computation. The key architectural stages are as follows:

  1. Buffering & Blocking: Incoming K and V vectors are buffered in memory until a fixed block size NN (e.g., $64$ tokens) is reached, forming tiles of shape [N,H×D][N, H \times D] (K) and [H,N×D][H, N \times D] (V) for efficient coalesced access.
  2. Token-wise Quantization: Each vector within the block is quantized using a relative scale, s=α(maxmin)s = \alpha \cdot (\max - \min) for a chosen α(0,1]\alpha \in (0, 1], with reconstructed value v~i=min+sv^i\widetilde{v}_i = \min + s \cdot \hat{v}_i and error v~ivis/2|\widetilde{v}_i - v_i| \leq s/2.
  3. Encode-aware Repacking: Vector packs of size kk are greedily or median-sorted grouped such that the per-pack bit range (max-min) is minimized, thus reducing the bitwidth needed for representation.
  4. Bit-packing Encoding: Each pack is stored as a header (b,offset)(b, \text{offset}) where b=log2(R+1)b = \lceil \log_2(R+1)\rceil for range RR, followed by the payload which stores the kk quantized values offset by min\min and concatenated using bb-bit encoding.
  5. Seamless Appending: Each compressed block is tagged with indices, supporting direct expansion as context grows, without format transformation or multiple kernel invocations.

At inference, a single CUDA kernel fuses both decompression and GEMV: each thread loads packed data, unpacks, reconstructs into half-precision, and directly computes the dot product with the query vector QQ in registers, thus eliminating repeated global memory IO (Jiang et al., 30 Dec 2025).

3. Lossy Compression and Quantization

PackKV’s sole lossy operation is token-wise quantization, ensuring all remaining transformations are lossless relative to this quantized representation. For each token and channel, the quantization uses a scale set per block: v^i(d)=round(vi(d)vmins)\hat{v}_i(d) = \text{round}\left(\frac{v_i(d) - v_{\min}}{s}\right)

v~i(d)=vmin+sv^i(d)\widetilde{v}_i(d) = v_{\min} + s \cdot \hat{v}_i(d)

with reconstruction error tightly bounded by s/2s/2. By empirical tuning (e.g., rel_quant_scale of 0.10 for KK and 0.20 for VV), PackKV matches or surpasses the quantization error profile of state-of-the-art 2–4 bit schemes (e.g., KIVI) at far lower bitwidths (Jiang et al., 30 Dec 2025).

Encode-aware repacking groups vectors to minimize intra-pack value range, and thus necessary bitwidth per pack, using either a greedy centroid-based O(N2DN^2D) method or a V-Median O(NDND) strategy (sorting by the V vector’s trailing D/2D/2 dimensions), yielding 4.5–19.7% further compression beyond quantization and bit-packing alone.

Bit-packing encodes each pack into (bk)(b \cdot k) bits payload with compact headers, optimally fitting into cache-aligned 32 or 64 bit chunks when k=8,16k=8,16. This capitalizes on highly peaked quantized K and V histograms, with the typical b4b \leq 4 (Jiang et al., 30 Dec 2025).

4. Computational Efficiency and Throughput

PackKV fuses decompression and compute into a single CUDA kernel. For example, given a K-cache block, the core loop iterates over pack indices, unpacks bitfields into half2 registers, reconstructs floating-point values, and immediately accumulates the dot-product with the query—all in registers/shared memory. Compared to standard decompress–then–GEMV approaches, this reduces DRAM traffic by $5$–20×20\times and eliminates any global memory write-back of decompressed values.

Time complexity is O(LDH)O(LDH) per output, identical to standard GEMV. Memory requirements drop from O(LHD FP16)O(LHD~\text{FP16}) to O((LHD)/CR)O((LHD)/\text{CR}) where CR\text{CR} (compression ratio) is 15.3×15.3\times (K) and 18.7×18.7\times (V) on average. Temporary GPU memory is only O(HD)O(HD) for block processing.

On NVIDIA A100 GPUs (LLaMA3.1-8B, Mistral-8B @ 128K context), PackKV achieves 1.75×1.75\times (K) and 2.7×2.7\times (V) the GB/s throughput of cuBLAS, with end-to-end decode acceleration of 75.7%75.7\% (K) and 171.7%171.7\% (V) over baseline (Jiang et al., 30 Dec 2025).

5. Experimental Results

PackKV was evaluated on LLaMA2-7B/13B, LLaMA3.1-8B, DeepSeek-R1-Llama-8B, Mistral-8B-2410, and Phi-4 using benchmarks such as CoQA, GSM8K, MMLU, Winogrande, GPQA_D, and SQuAD_C with context lengths up to $128$K and batch sizes up to $8$. Main quantitative findings:

Baseline K Cache Compression Ratio V Cache Compression Ratio
2-bit KIVI 5.91× 6.00×
PackKV (token+bitpack) 15.30× (+153.2%) 18.67× (+179.6%)

PackKV maintained 5%\leq 5\% accuracy drop relative to full FP16 inference, outperforming KIVI at equivalent accuracy thresholds. Greedy repacking improves K by 4.5%4.5\%, V by an additional 19.7%19.7\%, while V-Median achieves 17.7%17.7\% improvement at linear time.

Multi-GPU scaling was demonstrated up to 4 A100s with near-perfect weak scaling (throughput drop <2%<2\%). Peak GPU DRAM bandwidth decreases proportional to the achieved compression ratio, enabling increased batch sizes and longer contexts.

6. Practical Integration and Open Source Considerations

PackKV implements block-independent, indexable buffer formatting, and all compression and decompression logic is embedded in CUDA custom kernels using shared memory and half2 vector instructions. Pack alignment (8–16 entries per pack) avoids bank conflicts. The buffer is amenable to efficient streaming generation implementations (e.g., in PyTorch custom ops). Bandwidth and memory efficiency are further improved via the elimination of superfluous kernel launches and the strictly append-only nature of the compressed cache format (Jiang et al., 30 Dec 2025).

Full codebase, scripts, and integration examples are released at https://github.com/BoJiang03/PackKV. The open-source release supports immediate integration into generic transformer inference pipelines.

PackKV differs fundamentally from prevailing KV compression techniques. Unlike standard quantization (KIVI, 2–4 bit per channel) which is typically channel-wise for KK and token-wise for VV, PackKV employs token-wise quantization and LLM-aware repacking for both. Unlike structured composite retention approaches (e.g., KVCompose (Akulov et al., 5 Sep 2025)) or online subspace compression (e.g., OjaKV (Zhu et al., 25 Sep 2025)), PackKV’s approach is fully orthogonal and can, in principle, be combined with those strategies for further memory–bandwidth reductions. Furthermore, unlike methods requiring expensive offline pre-processing, PackKV injects negligible latency and eliminates all global decompression overhead.

The empirical results demonstrate that, at comparable or lower error budgets, PackKV delivers over 2×2\times the memory reduction of leading quantization baselines, and 1.7×1.7\times to 2.7×2.7\times improvements in end-to-end decoding throughput, representing a significant advance in the practical deployment of long context LLMs on memory-limited commodity GPUs (Jiang et al., 30 Dec 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PackKV.