KVSink: Enhancing KV Quantization in LLMs
- KVSink is a plug-and-play methodology for KV cache quantization that preserves critical attention sinks—token positions absorbing disproportionate attention mass.
- The approach dynamically identifies sink tokens at a fixed emergence layer, excluding them from quantization to recover over 95% of FP16 perplexity.
- Empirical evaluations on models like LLaMA2 show that KVSink outperforms PFN, offering robust performance and minimal computational overhead for aggressive model compression.
KVSink is a plug-and-play methodology for key-value (KV) cache quantization in LLM inference that targets the preservation of “attention sinks”—token positions that disproportionately absorb attention mass and introduce quantization sensitivity. It advances prior quantization pipelines by algorithmically predicting sink tokens during inference, permitting targeted exclusion from quantization and thereby maximizing perplexity (PPL) recovery with minimal overhead (Su et al., 6 Aug 2025).
1. Definition and Role of Attention Sinks
An attention sink is formally a token position that consistently aggregates an outsized fraction of attention mass in multiple heads and layers. For a single head at layer , attention scores are defined by
and the set of sinks for that head is
Aggregating across all heads and layers,
Sink tokens frequently show numerically small due to QKV suppression, so quantization error in -cache at these indices yields disproportionately large errors in attention outputs. Formally, the attention output can be decomposed as
where the sink bias term is nearly constant across and highly sensitive to quantization of the corresponding .
2. Cross-Layer Dynamics of Extreme Activation Outliers
The emergence and propagation of extreme channel-wise outliers underlie the formation of attention sinks. Tracking the tensors per block :
one observes five distinct stages in outlier magnitude and position:
- Initial: Absence of extreme outliers.
- Emergence (layer ): Spikes in propagate to , creating stable outliers in .
- Stabilization: Intermediate layers sustain large, consistent outliers in and , while and subside.
- Dissipation: Counter-spikes in nullify prior stable outliers.
- Final: All activations return to baseline magnitudes.
These persistent, stable outliers correspond to emerging and sustained attention sinks, which significantly influence downstream attention and the sensitivity to quantization.
3. Limitations of the Preserve-First-N Strategy
The Preserve-First-N (PFN) strategy excludes the first tokens’ KVs from quantization. There are two core limitations:
- Attention sinks may occur at positions beyond (e.g., token 14 in LLaMA2-7B prefill). PFN cannot detect or exclude such outliers.
- Empirical evidence demonstrates sudden PPL degradation if a non-excluded sink emerges after , regardless of how many initial tokens are preserved. Thus, PFN is not robust against atypical sink locations and cannot guarantee PPL stability (Su et al., 6 Aug 2025).
4. KVSink: Algorithmic Sink Token Identification and Preservation
KVSink mitigates these limitations by dynamically predicting sink positions in the prefill phase through activation outlier identification at a fixed emergence layer . The algorithm:
- Selects a pre-identified outlier channel at layer .
- For the hidden-state tensor , determines the threshold as the -th largest over token positions.
- The predicted sink set comprises all positions where .
Pseudocode:
1 2 3 4 5 6 7 8 9 |
S_sink ← ∅ for l = 1..L: Compute Q^l, K^l, V^l from H^{l-1} Quantize K^l[i], V^l[i] for all i ∉ S_sink Run attention and FFN to get H^l if l == l_E: Find threshold T as the k-th largest |H^l[:,c]| S_sink ← { i | |H^l[i,c]| ≥ T } return H^L |
The quantization step uses:
Computational complexity is dominated by a single top- operation (), negligible for contemporary context lengths (, ).
5. Experimental Validation and Comparative Performance
Experiments span models including LLaMA2-7B/13B/70B, LLaMA2-7B-chat, Mistral-7B, LLaMA3-8B, and LLaMA3.1-8B-instruct, evaluated on Wikitext-2 and C4 with variable quantization schemes (RTN INT2/INT4, static/dynamic per-token and per-channel). Baselines are PFN() and KVQuant.
Results Overview
| Model/Method | FP16 PPL | PFN(5) PPL | KVSink(5) PPL |
|---|---|---|---|
| LLaMA2-70B, 4-bit | 2.5 | 59.5 | 5.0 |
Across all tested LLMs, preserving sinks via KVSink recovers of FP16 PPL, while PFN requires and still fails in some cases.
For KVSink + KVQuant integration (Wikitext-2, LLaMA2-7B, 2-bit quantization, 1% outlier isolation):
- KVQuant PPL: 5.53
- KVSink-5 PPL: 5.44
On C4 dataset: 6.94 6.81 (KVQuant KVSink).
Importantly, lowering the outlier-isolation budget (to 0.1%) does not hurt PPL with KVSink, enabling more aggressive model compression.
6. Guidelines for Practical Integration
KVSink integration follows static configuration per architecture:
- Identify once for each model (cf. Table 9).
- Target sink tokens (empirically sufficient for tested LLMs).
- In static-quant pipelines, also exclude sink set from calibration.
- Implementation overhead is minimal: 0.04–0.05 ms latency per 4K-token sequence (A100), MB memory for storage.
- Recommended practice: wrap quantizer, perform outlier detection at , and “freeze” sink set for remainder of inference.
KVSink thus provides a methodologically grounded, low-cost approach for attention sink preservation, consistently improving quantized inference accuracy over both PFN and state-of-the-art alternatives (Su et al., 6 Aug 2025).