Papers
Topics
Authors
Recent
Search
2000 character limit reached

AQUA-KV: Adaptive Quantization for KV Caches

Updated 25 February 2026
  • AQUA-KV is a compression technique for transformer caches that exploits inter-layer predictability using lightweight linear predictors and residual quantization.
  • It achieves near-lossless accuracy at 2–2.5 bits per value, reducing memory footprint by up to 86% while maintaining minimal performance degradation.
  • The approach supports both LLMs and VLMs, enabling practical deployment in long-context scenarios with efficient one-shot calibration and a modest 3% inference overhead.

AQUA-KV (Adaptive Quantization for Key-Value Caches) is a state-of-the-art approach for extreme compression of Key-Value (KV) caches in transformer-based neural networks, achieving near-lossless accuracy at 2–2.5 bits per value in both LLMs and Vision-LLMs (VLMs). AQUA-KV exploits inherent inter-layer dependencies in transformer KV pairs using lightweight linear predictors, combined with residual quantization, to minimize memory footprint while preserving model quality. The approach is efficient, agnostic to quantization backbones, and requires only one-shot calibration per model, enabling practical deployment in long-context scenarios where standard KV caching incurs tens of gigabytes of memory overhead (Shutova et al., 31 Jan 2025, Su et al., 25 Jan 2025).

1. Motivation and Problem Setting

Transformer architectures rely on KV caching to enable efficient autoregressive inference by storing projected key and value vectors for each token at each layer, thereby circumventing redundant computations. For long sequences, the aggregated KV cache can easily exceed the model's parameter size: e.g., for 131K token contexts and 70B parameter LLMs, caches approach ~43 GiB in FP16 precision. Classic quantization, token or channel pruning, and layer merging reduce memory but generally degrade model accuracy under severe bit budgets (e.g., <4 bits/value), compromising downstream performance metrics such as perplexity or task accuracy (Su et al., 25 Jan 2025, Shutova et al., 31 Jan 2025).

AQUA-KV addresses this with a dual insight:

  • Transformer KV activations exhibit high inter-layer predictability; most of the information at layer LL's keys and values can be linearly explained from layer L1L-1.
  • After regressing out the predictable component, the residuals have significantly lower dynamic range and entropy, making them especially amenable to robust quantization.

This pattern is observed across major LLMs (Llama 3.x, Qwen 2.5) and VLMs (LLaVA-v1.5/1.6), and matches empirical explained-variance ratios (R2>0.89R^2 > 0.89) indicative of good 2-bit quantizers (Shutova et al., 31 Jan 2025, Su et al., 25 Jan 2025).

2. Algorithm: Predictor-Based Residual Quantization

2.1 Predictor Construction

AQUA-KV introduces a sequential, layerwise regression mechanism:

  • For each transformer layer LL, fit linear predictors fK(L)f_K^{(L)} (key) and fV(L)f_V^{(L)} (value).
  • fK(L)f_K^{(L)} maps reconstructed keys from the previous layer, K^(L1)\widehat{\mathbf K}^{(L-1)}, to current keys K(L)\mathbf K^{(L)}.
  • fV(L)f_V^{(L)} takes reconstructed values V^(L1)\widehat{\mathbf V}^{(L-1)} and current keys K^(L)\widehat{\mathbf K}^{(L)} as inputs for predicting V(L)\mathbf V^{(L)}.

For L>1L > 1: fK(L)=argminff(K^(L1))K(L)22f_K^{(L)} = \arg\min_f \|f(\widehat{\mathbf K}^{(L-1)}) - \mathbf K^{(L)}\|_2^2

fV(L)=argminff([V^(L1); K^(L)])V(L)22f_V^{(L)} = \arg\min_f \|f([\widehat{\mathbf V}^{(L-1)};\ \widehat{\mathbf K}^{(L)}]) - \mathbf V^{(L)}\|_2^2

The predictors are closed-form linear regressors; cross-token or multi-layer predictors provide negligible additional benefit over simple per-layer mappings (Shutova et al., 31 Jan 2025).

2.2 Residual Quantization

Having predicted the keys/values, AQUA-KV quantizes the residuals: ΔK(L)=K(L)fK(L)(K^(L1))\Delta \mathbf K^{(L)} = \mathbf K^{(L)} - f_K^{(L)}(\widehat{\mathbf K}^{(L-1)})

RK(L)=Q(ΔK(L))R_K^{(L)} = Q(\Delta\mathbf K^{(L)})

Similar equations apply to values. Here, Q()Q(\cdot) can represent any quantizer (e.g., Quanto: absmax-scaling uniform, HIGGS: Hadamard-RHT and lattice VQ) (Shutova et al., 31 Jan 2025).

During inference, the full value is reconstructed by applying inverse predictors and dequantizing the residual: K^(L)=fK(L)(K^(L1))+Q1(RK(L))\widehat{\mathbf K}^{(L)} = f_K^{(L)}(\widehat{\mathbf K}^{(L-1)}) + Q^{-1}(R_K^{(L)})

2.3 Calibration and Inference

Calibration uses a representative corpus (e.g. 256 sequences, 8192 tokens each from RedPajama) and takes 1–6 hours on a single A100 GPU. Predictor storage is modest (~162 MiB), and inference overhead is ≈3% relative to standard quantization (Shutova et al., 31 Jan 2025).

3. Bit-Width Allocation and Quantization Strategies

AQUA-KV is quantizer-agnostic. In practice, it employs:

The bit allocation is static per approach, but adaptive strategies are suggested for future work (allocating bits per layer in proportion to predictor-explained variance).

4. Extension: AKVQ-VL for Vision-Language Transformers

The AKVQ-VL approach extends AQUA-KV to VLMs, adding attention-aware token saliency analysis and outlier mitigation via orthogonal transforms (Su et al., 25 Jan 2025):

  • Text-Salient Attention (TSA): In early transformer layers (=0,1\ell=0,1), attention strongly favors text tokens. All text tokens in these layers are treated as salient and assigned higher quantization precision.
  • Pivot-Token-Salient Attention (PSA): In deeper layers, most tokens focus their attention on a small subset of "pivot" tokens, detected as positions with "massive activations" in the hidden states. These are assigned higher precision.

Walsh–Hadamard Transform

AKVQ-VL applies the Walsh–Hadamard Transform (WHT) to keys (and optionally values) along the channel dimension:

  • K=KHnK' = K \cdot H_n
  • Reduces large singular channels (outliers) by up to 10×10\times.
  • Post-WHT, clipping ratios for 2-bit quantization reduce from 0.20\sim0.20 to 0.03\sim0.03 improving quantization accuracy dramatically.

Adaptive Bit-Budget

AKVQ-VL employs a three-tier heuristic:

  • Category 1 (pivots + NrecN_{rec} most recent tokens): 16-bit precision
  • Category 2 (remaining text tokens): 4-bit
  • Category 3 (all other vision tokens): 2-bit

This enables extremely low memory footprints with negligible quality loss (Su et al., 25 Jan 2025).

5. Empirical Results

LLMs

Method Bits Perplexity (WikiText-2) LongBench Score
Uncompressed (FP16) 16 6.98 44.61
HIGGS ≈2 7.47 (+7.0%) 42.80 (–4.0%)
AQUA-KV + HIGGS 2.16 7.03 (+0.7%) 44.26 (–0.8%)
KIVI 2.25 9.34 (+33.8%) 39.64 (–11.2%)
KVQuant 2.33 9.43 (+35.1%) 20.56 (–53.9%)

AQUA-KV at 2.09–2.39 bits/value achieves an 86% memory reduction with less than 1% relative error in perplexity and LongBench metrics (Shutova et al., 31 Jan 2025).

VLMs

Method Bits Avg. Acc. (12 tasks)
FP16 16 51.5%
RTN (INT4) 4 45.8%
RTN (INT2) 2 19.1%
SmoothQuant (2) 2 29.8%
KIVI (2) 2 22.6%
SKVQ (2) 2 31.1%
AKVQ-VL 2 52.7%

AKVQ-VL matches or exceeds uncompressed accuracy (52.7% vs. 51.5%), while all previous 2-bit LLM-targeted schemes collapse (<32%). It yields a 2.13× reduction in peak memory, 3.25× larger batch size, and 2.46× higher throughput in 500-token context LLaVA-v1.5-7B (Su et al., 25 Jan 2025).

6. Limitations and Future Directions

Limitations identified include:

  • Requires per-model calibration, though training is one-shot and quantizer-agnostic training worsens quality.
  • Marginal inference overhead (∼3%) and minor predictor storage (~162 MiB).
  • Per-architecture adjustments (e.g., handling of "attention sink" tokens).

Directions for further research include:

  • Adaptive per-layer bit-width allocation using explained variance.
  • Joint predictor fine-tuning with end-to-end loss.
  • Fusing predictor computation with core model kernels to reduce overhead.
  • Theoretical investigation of why transformer KV projections are highly predictable, potentially informing novel architecture designs (Shutova et al., 31 Jan 2025, Su et al., 25 Jan 2025).

7. Comparisons, Synergies, and Broader Context

AQUA-KV outperforms earlier cache compression approaches including uniform/absmax quantization, KIVI, SKVQ, and merging methods, especially at extreme compression (<3 bits/value). Ablation demonstrates that both key and value predictors are essential for accuracy; skipping initial tokens or detaching Q from calibration loops incurs measurable degradation. Experimental integration with H2O (heavy-hitter token pruning) shows orthogonality: combining both can further improve memory and compute efficiency with minimal accuracy impact (Shutova et al., 31 Jan 2025).

AQUA-KV establishes a new standard for practical transformer inference under stringent memory and bandwidth constraints, enabling scalable long-context applications in both text-only and multimodal (vision-language) settings.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AQUA-KV Approach.