KVSink: Enhancing KV Quantization in LLMs

Updated 17 December 2025

KVSink is a plug-and-play methodology for KV cache quantization that preserves critical attention sinks—token positions absorbing disproportionate attention mass.
The approach dynamically identifies sink tokens at a fixed emergence layer, excluding them from quantization to recover over 95% of FP16 perplexity.
Empirical evaluations on models like LLaMA2 show that KVSink outperforms PFN, offering robust performance and minimal computational overhead for aggressive model compression.

KVSink is a plug-and-play methodology for key-value (KV) cache quantization in LLM inference that targets the preservation of “attention sinks”—token positions that disproportionately absorb attention mass and introduce quantization sensitivity. It advances prior quantization pipelines by algorithmically predicting sink tokens during inference, permitting targeted exclusion from quantization and thereby maximizing perplexity (PPL) recovery with minimal overhead (Su et al., 6 Aug 2025).

1. Definition and Role of Attention Sinks

An attention sink is formally a token position $i$ that consistently aggregates an outsized fraction of attention mass in multiple heads and layers. For a single head at layer $l$ , attention scores are defined by

$e^{l,k}_{t,i} = \frac{\langle q^{l,k}_t, k^{l,k}_i \rangle}{\sqrt{d_k}},\quad p^{l,k}_{t,i} = \mathrm{Softmax}_i(e^{l,k}_{t,i}),$

and the set of sinks for that head is

$S^{l,k} = \left\{ i~|~\forall t,\; p^{l,k}_{t,i} \gg \frac{1}{t} \right\}.$

Aggregating across all heads and layers,

$S = \bigcup_{l,k} S^{l,k}.$

Sink tokens $i \in S$ frequently show numerically small $v^{l,k}_i$ due to QKV suppression, so quantization error in $V$ -cache at these indices yields disproportionately large errors in attention outputs. Formally, the attention output can be decomposed as

$\mathrm{Attention}(Q,K,V)^t = \underbrace{\sum_{i \notin S} p_i^t v_i}_{\text{ordinary}} + \underbrace{\sum_{i \in S} p_i^t v_i}_{\text{sink bias}},$

where the sink bias term is nearly constant across $t$ and highly sensitive to quantization of the corresponding $v_i$ .

2. Cross-Layer Dynamics of Extreme Activation Outliers

The emergence and propagation of extreme channel-wise outliers underlie the formation of attention sinks. Tracking the tensors per block $l$ : $X^l_{\mathrm{d\_in}} = \sigma(\mathrm{LN}(H^{l'}) W_g) \odot \mathrm{LN}(H^{l'}) W_u,$

$X^l_{\mathrm{d\_out}} = X^l_{\mathrm{d\_in}} W_d,$

$H^{l'} = O^l + H^{l-1},\quad H^l = \mathrm{FFN}(\mathrm{LN}(H^{l'})) + H^{l'},$

one observes five distinct stages in outlier magnitude and position:

Initial: Absence of extreme outliers.
Emergence (layer $l_E$ ): Spikes in $X^l_{\mathrm{d\_in}}$ propagate to $X^l_{\mathrm{d\_out}}$ , creating stable outliers in $H^l$ .
Stabilization: Intermediate layers sustain large, consistent outliers in $H^l$ and $H^{l'}$ , while $X^l_{\mathrm{d\_in}}$ and $X^l_{\mathrm{d\_out}}$ subside.
Dissipation: Counter-spikes in $X^l_{\mathrm{d\_in}}$ nullify prior stable outliers.
Final: All activations return to baseline magnitudes.

These persistent, stable outliers correspond to emerging and sustained attention sinks, which significantly influence downstream attention and the sensitivity to quantization.

3. Limitations of the Preserve-First-N Strategy

The Preserve-First-N (PFN) strategy excludes the first $N$ tokens’ KVs from quantization. There are two core limitations:

Attention sinks may occur at positions beyond $N$ (e.g., token 14 in LLaMA2-7B prefill). PFN cannot detect or exclude such outliers.
Empirical evidence demonstrates sudden PPL degradation if a non-excluded sink emerges after $N$ , regardless of how many initial tokens are preserved. Thus, PFN is not robust against atypical sink locations and cannot guarantee PPL stability (Su et al., 6 Aug 2025).

4. KVSink: Algorithmic Sink Token Identification and Preservation

KVSink mitigates these limitations by dynamically predicting sink positions in the prefill phase through activation outlier identification at a fixed emergence layer $l_E$ . The algorithm:

Selects a pre-identified outlier channel $c$ at layer $l_E$ .
For the hidden-state tensor $H^{l_E} \in \mathbb{R}^{n \times d}$ , determines the threshold $T$ as the $k$ -th largest $|H^{l_E}_{i,c}|$ over token positions.
The predicted sink set $S_{\mathrm{sink}}$ comprises all positions $i$ where $|H^{l_E}_{i,c}| \ge T$ .

Pseudocode:

S_sink ← ∅
for l = 1..L:
    Compute Q^l, K^l, V^l from H^{l-1}
    Quantize K^l[i], V^l[i] for all i ∉ S_sink
    Run attention and FFN to get H^l
    if l == l_E:
        Find threshold T as the k-th largest |H^l[:,c]|
        S_sink ← { i | |H^l[i,c]| ≥ T }
return H^L

The quantization step uses: $K^l_{i,\mathrm{q}} = \begin{cases} \mathrm{quantize}(K^l_i), & i \notin S_{\mathrm{sink}} \ K^l_i, & i \in S_{\mathrm{sink}} \end{cases} \quad V^l_{i,\mathrm{q}}~\text{analogously.}$

Computational complexity is dominated by a single top- $k$ operation ( $O(n \log k)$ ), negligible for contemporary context lengths ( $n \approx 2048$ , $k \approx 5$ ).

5. Experimental Validation and Comparative Performance

Experiments span models including LLaMA2-7B/13B/70B, LLaMA2-7B-chat, Mistral-7B, LLaMA3-8B, and LLaMA3.1-8B-instruct, evaluated on Wikitext-2 and C4 with variable quantization schemes (RTN INT2/INT4, static/dynamic per-token and per-channel). Baselines are PFN( $N$ ) and KVQuant.

Results Overview

Model/Method	FP16 PPL	PFN(5) PPL	KVSink(5) PPL
LLaMA2-70B, 4-bit	2.5	59.5	5.0

Across all tested LLMs, preserving $k = 5$ sinks via KVSink recovers $>95\%$ of FP16 PPL, while PFN requires $N \gg 5$ and still fails in some cases.

For KVSink + KVQuant integration (Wikitext-2, LLaMA2-7B, 2-bit quantization, 1% outlier isolation):

KVQuant PPL: 5.53
KVSink-5 PPL: 5.44

On C4 dataset: 6.94 $\rightarrow$ 6.81 (KVQuant $\rightarrow$ KVSink).

Importantly, lowering the outlier-isolation budget (to 0.1%) does not hurt PPL with KVSink, enabling more aggressive model compression.

6. Guidelines for Practical Integration

KVSink integration follows static configuration per architecture:

Identify $(l_E, c)$ once for each model (cf. Table 9).
Target $k \approx 5$ sink tokens (empirically sufficient for tested LLMs).
In static-quant pipelines, also exclude sink set $\{ i \in S_{\mathrm{sink}} \}$ from calibration.
Implementation overhead is minimal: $+$ 0.04–0.05 ms latency per 4K-token sequence (A100), $< 0.1$ MB memory for storage.
Recommended practice: wrap quantizer, perform outlier detection at $l_E$ , and “freeze” sink set for remainder of inference.

KVSink thus provides a methodologically grounded, low-cost approach for attention sink preservation, consistently improving quantized inference accuracy over both PFN and state-of-the-art alternatives (Su et al., 6 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KVSink.

KVSink: Enhancing KV Quantization in LLMs

1. Definition and Role of Attention Sinks

2. Cross-Layer Dynamics of Extreme Activation Outliers

3. Limitations of the Preserve-First-N Strategy

4. KVSink: Algorithmic Sink Token Identification and Preservation

5. Experimental Validation and Comparative Performance

Results Overview

6. Guidelines for Practical Integration

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

KVSink: Enhancing KV Quantization in LLMs

1. Definition and Role of Attention Sinks

2. Cross-Layer Dynamics of Extreme Activation Outliers

3. Limitations of the Preserve-First-N Strategy

4. KVSink: Algorithmic Sink Token Identification and Preservation

5. Experimental Validation and Comparative Performance

Results Overview

6. Guidelines for Practical Integration

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research