AQUA-KV: Adaptive Quantization for KV Caches
- AQUA-KV is a compression technique for transformer caches that exploits inter-layer predictability using lightweight linear predictors and residual quantization.
- It achieves near-lossless accuracy at 2–2.5 bits per value, reducing memory footprint by up to 86% while maintaining minimal performance degradation.
- The approach supports both LLMs and VLMs, enabling practical deployment in long-context scenarios with efficient one-shot calibration and a modest 3% inference overhead.
AQUA-KV (Adaptive Quantization for Key-Value Caches) is a state-of-the-art approach for extreme compression of Key-Value (KV) caches in transformer-based neural networks, achieving near-lossless accuracy at 2–2.5 bits per value in both LLMs and Vision-LLMs (VLMs). AQUA-KV exploits inherent inter-layer dependencies in transformer KV pairs using lightweight linear predictors, combined with residual quantization, to minimize memory footprint while preserving model quality. The approach is efficient, agnostic to quantization backbones, and requires only one-shot calibration per model, enabling practical deployment in long-context scenarios where standard KV caching incurs tens of gigabytes of memory overhead (Shutova et al., 31 Jan 2025, Su et al., 25 Jan 2025).
1. Motivation and Problem Setting
Transformer architectures rely on KV caching to enable efficient autoregressive inference by storing projected key and value vectors for each token at each layer, thereby circumventing redundant computations. For long sequences, the aggregated KV cache can easily exceed the model's parameter size: e.g., for 131K token contexts and 70B parameter LLMs, caches approach ~43 GiB in FP16 precision. Classic quantization, token or channel pruning, and layer merging reduce memory but generally degrade model accuracy under severe bit budgets (e.g., <4 bits/value), compromising downstream performance metrics such as perplexity or task accuracy (Su et al., 25 Jan 2025, Shutova et al., 31 Jan 2025).
AQUA-KV addresses this with a dual insight:
- Transformer KV activations exhibit high inter-layer predictability; most of the information at layer 's keys and values can be linearly explained from layer .
- After regressing out the predictable component, the residuals have significantly lower dynamic range and entropy, making them especially amenable to robust quantization.
This pattern is observed across major LLMs (Llama 3.x, Qwen 2.5) and VLMs (LLaVA-v1.5/1.6), and matches empirical explained-variance ratios () indicative of good 2-bit quantizers (Shutova et al., 31 Jan 2025, Su et al., 25 Jan 2025).
2. Algorithm: Predictor-Based Residual Quantization
2.1 Predictor Construction
AQUA-KV introduces a sequential, layerwise regression mechanism:
- For each transformer layer , fit linear predictors (key) and (value).
- maps reconstructed keys from the previous layer, , to current keys .
- takes reconstructed values and current keys as inputs for predicting .
For :
The predictors are closed-form linear regressors; cross-token or multi-layer predictors provide negligible additional benefit over simple per-layer mappings (Shutova et al., 31 Jan 2025).
2.2 Residual Quantization
Having predicted the keys/values, AQUA-KV quantizes the residuals:
Similar equations apply to values. Here, can represent any quantizer (e.g., Quanto: absmax-scaling uniform, HIGGS: Hadamard-RHT and lattice VQ) (Shutova et al., 31 Jan 2025).
During inference, the full value is reconstructed by applying inverse predictors and dequantizing the residual:
2.3 Calibration and Inference
Calibration uses a representative corpus (e.g. 256 sequences, 8192 tokens each from RedPajama) and takes 1–6 hours on a single A100 GPU. Predictor storage is modest (~162 MiB), and inference overhead is ≈3% relative to standard quantization (Shutova et al., 31 Jan 2025).
3. Bit-Width Allocation and Quantization Strategies
AQUA-KV is quantizer-agnostic. In practice, it employs:
- Quanto: per-token uniform quantization with absmax scaling.
- HIGGS: combines the Hadamard transform and lattice vector quantization for enhanced low-bit operation (Shutova et al., 31 Jan 2025).
The bit allocation is static per approach, but adaptive strategies are suggested for future work (allocating bits per layer in proportion to predictor-explained variance).
4. Extension: AKVQ-VL for Vision-Language Transformers
The AKVQ-VL approach extends AQUA-KV to VLMs, adding attention-aware token saliency analysis and outlier mitigation via orthogonal transforms (Su et al., 25 Jan 2025):
- Text-Salient Attention (TSA): In early transformer layers (), attention strongly favors text tokens. All text tokens in these layers are treated as salient and assigned higher quantization precision.
- Pivot-Token-Salient Attention (PSA): In deeper layers, most tokens focus their attention on a small subset of "pivot" tokens, detected as positions with "massive activations" in the hidden states. These are assigned higher precision.
Walsh–Hadamard Transform
AKVQ-VL applies the Walsh–Hadamard Transform (WHT) to keys (and optionally values) along the channel dimension:
- Reduces large singular channels (outliers) by up to .
- Post-WHT, clipping ratios for 2-bit quantization reduce from to improving quantization accuracy dramatically.
Adaptive Bit-Budget
AKVQ-VL employs a three-tier heuristic:
- Category 1 (pivots + most recent tokens): 16-bit precision
- Category 2 (remaining text tokens): 4-bit
- Category 3 (all other vision tokens): 2-bit
This enables extremely low memory footprints with negligible quality loss (Su et al., 25 Jan 2025).
5. Empirical Results
LLMs
| Method | Bits | Perplexity (WikiText-2) | LongBench Score |
|---|---|---|---|
| Uncompressed (FP16) | 16 | 6.98 | 44.61 |
| HIGGS | ≈2 | 7.47 (+7.0%) | 42.80 (–4.0%) |
| AQUA-KV + HIGGS | 2.16 | 7.03 (+0.7%) | 44.26 (–0.8%) |
| KIVI | 2.25 | 9.34 (+33.8%) | 39.64 (–11.2%) |
| KVQuant | 2.33 | 9.43 (+35.1%) | 20.56 (–53.9%) |
AQUA-KV at 2.09–2.39 bits/value achieves an 86% memory reduction with less than 1% relative error in perplexity and LongBench metrics (Shutova et al., 31 Jan 2025).
VLMs
| Method | Bits | Avg. Acc. (12 tasks) |
|---|---|---|
| FP16 | 16 | 51.5% |
| RTN (INT4) | 4 | 45.8% |
| RTN (INT2) | 2 | 19.1% |
| SmoothQuant (2) | 2 | 29.8% |
| KIVI (2) | 2 | 22.6% |
| SKVQ (2) | 2 | 31.1% |
| AKVQ-VL | 2 | 52.7% |
AKVQ-VL matches or exceeds uncompressed accuracy (52.7% vs. 51.5%), while all previous 2-bit LLM-targeted schemes collapse (<32%). It yields a 2.13× reduction in peak memory, 3.25× larger batch size, and 2.46× higher throughput in 500-token context LLaVA-v1.5-7B (Su et al., 25 Jan 2025).
6. Limitations and Future Directions
Limitations identified include:
- Requires per-model calibration, though training is one-shot and quantizer-agnostic training worsens quality.
- Marginal inference overhead (∼3%) and minor predictor storage (~162 MiB).
- Per-architecture adjustments (e.g., handling of "attention sink" tokens).
Directions for further research include:
- Adaptive per-layer bit-width allocation using explained variance.
- Joint predictor fine-tuning with end-to-end loss.
- Fusing predictor computation with core model kernels to reduce overhead.
- Theoretical investigation of why transformer KV projections are highly predictable, potentially informing novel architecture designs (Shutova et al., 31 Jan 2025, Su et al., 25 Jan 2025).
7. Comparisons, Synergies, and Broader Context
AQUA-KV outperforms earlier cache compression approaches including uniform/absmax quantization, KIVI, SKVQ, and merging methods, especially at extreme compression (<3 bits/value). Ablation demonstrates that both key and value predictors are essential for accuracy; skipping initial tokens or detaching Q from calibration loops incurs measurable degradation. Experimental integration with H2O (heavy-hitter token pruning) shows orthogonality: combining both can further improve memory and compute efficiency with minimal accuracy impact (Shutova et al., 31 Jan 2025).
AQUA-KV establishes a new standard for practical transformer inference under stringent memory and bandwidth constraints, enabling scalable long-context applications in both text-only and multimodal (vision-language) settings.