Papers
Topics
Authors
Recent
Search
2000 character limit reached

KVSink: Enhancing KV Quantization in LLMs

Updated 17 December 2025
  • KVSink is a plug-and-play methodology for KV cache quantization that preserves critical attention sinks—token positions absorbing disproportionate attention mass.
  • The approach dynamically identifies sink tokens at a fixed emergence layer, excluding them from quantization to recover over 95% of FP16 perplexity.
  • Empirical evaluations on models like LLaMA2 show that KVSink outperforms PFN, offering robust performance and minimal computational overhead for aggressive model compression.

KVSink is a plug-and-play methodology for key-value (KV) cache quantization in LLM inference that targets the preservation of “attention sinks”—token positions that disproportionately absorb attention mass and introduce quantization sensitivity. It advances prior quantization pipelines by algorithmically predicting sink tokens during inference, permitting targeted exclusion from quantization and thereby maximizing perplexity (PPL) recovery with minimal overhead (Su et al., 6 Aug 2025).

1. Definition and Role of Attention Sinks

An attention sink is formally a token position ii that consistently aggregates an outsized fraction of attention mass in multiple heads and layers. For a single head at layer ll, attention scores are defined by

et,il,k=qtl,k,kil,kdk,pt,il,k=Softmaxi(et,il,k),e^{l,k}_{t,i} = \frac{\langle q^{l,k}_t, k^{l,k}_i \rangle}{\sqrt{d_k}},\quad p^{l,k}_{t,i} = \mathrm{Softmax}_i(e^{l,k}_{t,i}),

and the set of sinks for that head is

Sl,k={i  t,  pt,il,k1t}.S^{l,k} = \left\{ i~|~\forall t,\; p^{l,k}_{t,i} \gg \frac{1}{t} \right\}.

Aggregating across all heads and layers,

S=l,kSl,k.S = \bigcup_{l,k} S^{l,k}.

Sink tokens iSi \in S frequently show numerically small vil,kv^{l,k}_i due to QKV suppression, so quantization error in VV-cache at these indices yields disproportionately large errors in attention outputs. Formally, the attention output can be decomposed as

Attention(Q,K,V)t=iSpitviordinary+iSpitvisink bias,\mathrm{Attention}(Q,K,V)^t = \underbrace{\sum_{i \notin S} p_i^t v_i}_{\text{ordinary}} + \underbrace{\sum_{i \in S} p_i^t v_i}_{\text{sink bias}},

where the sink bias term is nearly constant across tt and highly sensitive to quantization of the corresponding viv_i.

2. Cross-Layer Dynamics of Extreme Activation Outliers

The emergence and propagation of extreme channel-wise outliers underlie the formation of attention sinks. Tracking the tensors per block ll: Xd_inl=σ(LN(Hl)Wg)LN(Hl)Wu,X^l_{\mathrm{d\_in}} = \sigma(\mathrm{LN}(H^{l'}) W_g) \odot \mathrm{LN}(H^{l'}) W_u,

Xd_outl=Xd_inlWd,X^l_{\mathrm{d\_out}} = X^l_{\mathrm{d\_in}} W_d,

Hl=Ol+Hl1,Hl=FFN(LN(Hl))+Hl,H^{l'} = O^l + H^{l-1},\quad H^l = \mathrm{FFN}(\mathrm{LN}(H^{l'})) + H^{l'},

one observes five distinct stages in outlier magnitude and position:

  • Initial: Absence of extreme outliers.
  • Emergence (layer lEl_E): Spikes in Xd_inlX^l_{\mathrm{d\_in}} propagate to Xd_outlX^l_{\mathrm{d\_out}}, creating stable outliers in HlH^l.
  • Stabilization: Intermediate layers sustain large, consistent outliers in HlH^l and HlH^{l'}, while Xd_inlX^l_{\mathrm{d\_in}} and Xd_outlX^l_{\mathrm{d\_out}} subside.
  • Dissipation: Counter-spikes in Xd_inlX^l_{\mathrm{d\_in}} nullify prior stable outliers.
  • Final: All activations return to baseline magnitudes.

These persistent, stable outliers correspond to emerging and sustained attention sinks, which significantly influence downstream attention and the sensitivity to quantization.

3. Limitations of the Preserve-First-N Strategy

The Preserve-First-N (PFN) strategy excludes the first NN tokens’ KVs from quantization. There are two core limitations:

  • Attention sinks may occur at positions beyond NN (e.g., token 14 in LLaMA2-7B prefill). PFN cannot detect or exclude such outliers.
  • Empirical evidence demonstrates sudden PPL degradation if a non-excluded sink emerges after NN, regardless of how many initial tokens are preserved. Thus, PFN is not robust against atypical sink locations and cannot guarantee PPL stability (Su et al., 6 Aug 2025).

4. KVSink: Algorithmic Sink Token Identification and Preservation

KVSink mitigates these limitations by dynamically predicting sink positions in the prefill phase through activation outlier identification at a fixed emergence layer lEl_E. The algorithm:

  • Selects a pre-identified outlier channel cc at layer lEl_E.
  • For the hidden-state tensor HlERn×dH^{l_E} \in \mathbb{R}^{n \times d}, determines the threshold TT as the kk-th largest Hi,clE|H^{l_E}_{i,c}| over token positions.
  • The predicted sink set SsinkS_{\mathrm{sink}} comprises all positions ii where Hi,clET|H^{l_E}_{i,c}| \ge T.

Pseudocode:

1
2
3
4
5
6
7
8
9
S_sink  
for l = 1..L:
    Compute Q^l, K^l, V^l from H^{l-1}
    Quantize K^l[i], V^l[i] for all i  S_sink
    Run attention and FFN to get H^l
    if l == l_E:
        Find threshold T as the k-th largest |H^l[:,c]|
        S_sink  { i | |H^l[i,c]|  T }
return H^L

The quantization step uses: Ki,ql={quantize(Kil),iSsink Kil,iSsinkVi,ql analogously.K^l_{i,\mathrm{q}} = \begin{cases} \mathrm{quantize}(K^l_i), & i \notin S_{\mathrm{sink}} \ K^l_i, & i \in S_{\mathrm{sink}} \end{cases} \quad V^l_{i,\mathrm{q}}~\text{analogously.}

Computational complexity is dominated by a single top-kk operation (O(nlogk)O(n \log k)), negligible for contemporary context lengths (n2048n \approx 2048, k5k \approx 5).

5. Experimental Validation and Comparative Performance

Experiments span models including LLaMA2-7B/13B/70B, LLaMA2-7B-chat, Mistral-7B, LLaMA3-8B, and LLaMA3.1-8B-instruct, evaluated on Wikitext-2 and C4 with variable quantization schemes (RTN INT2/INT4, static/dynamic per-token and per-channel). Baselines are PFN(NN) and KVQuant.

Results Overview

Model/Method FP16 PPL PFN(5) PPL KVSink(5) PPL
LLaMA2-70B, 4-bit 2.5 59.5 5.0

Across all tested LLMs, preserving k=5k = 5 sinks via KVSink recovers >95%>95\% of FP16 PPL, while PFN requires N5N \gg 5 and still fails in some cases.

For KVSink + KVQuant integration (Wikitext-2, LLaMA2-7B, 2-bit quantization, 1% outlier isolation):

  • KVQuant PPL: 5.53
  • KVSink-5 PPL: 5.44

On C4 dataset: 6.94 \rightarrow 6.81 (KVQuant \rightarrow KVSink).

Importantly, lowering the outlier-isolation budget (to 0.1%) does not hurt PPL with KVSink, enabling more aggressive model compression.

6. Guidelines for Practical Integration

KVSink integration follows static configuration per architecture:

  • Identify (lE,c)(l_E, c) once for each model (cf. Table 9).
  • Target k5k \approx 5 sink tokens (empirically sufficient for tested LLMs).
  • In static-quant pipelines, also exclude sink set {iSsink}\{ i \in S_{\mathrm{sink}} \} from calibration.
  • Implementation overhead is minimal: ++0.04–0.05 ms latency per 4K-token sequence (A100), <0.1< 0.1 MB memory for storage.
  • Recommended practice: wrap quantizer, perform outlier detection at lEl_E, and “freeze” sink set for remainder of inference.

KVSink thus provides a methodologically grounded, low-cost approach for attention sink preservation, consistently improving quantized inference accuracy over both PFN and state-of-the-art alternatives (Su et al., 6 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KVSink.