Papers
Topics
Authors
Recent
Search
2000 character limit reached

Separable Sink Divergence in LLM Safety

Updated 6 April 2026
  • The separable sink divergence hypothesis is a framework that uses the sign of sink divergence in attention heads to differentiate between harmful and safe pattern learning in large language models.
  • It quantifies sink divergence as the difference in attention allocated to sink tokens, yielding a bimodal distribution that correlates with increased model harmfulness when positive.
  • The framework underpins the Surgery defense, which regularizes positive sink divergence during fine-tuning to reduce harmful behaviors while maintaining performance.

The separable sink divergence hypothesis is an empirical and theoretical framework for understanding and mitigating harmful fine-tuning in LLMs via the behavior of internal attention mechanisms. It posits that attention heads implicated in learning harmful patterns can be linearly separated from those supporting safe refusal by the sign of a statistic called sink divergence, thus enabling targeted defenses such as Surgery that regularize this property during further fine-tuning (Liu et al., 5 Feb 2026).

1. Definition and Computation of Sink Divergence

Sink divergence is defined at the level of individual attention heads within a transformer architecture. For a given model input sequence, an attention sink token is the position kk that receives the highest aggregate attention across all heads and source positions. For each attention head hh and input X\mathbf X, the mean attention weight it assigns to the sink token is: αh(X)=1∣Nk∣∑i∈NkAh,k,i(X)\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X) where kk is the sink token index and Nk\mathcal N_{k} the set of non-sink positions.

Given batches of harmful (malicious) samples Xm\mathbf X_{m} and safe/refusal samples Xr\mathbf X_{r}, the sink divergence for head hh is: Dsink(h)=dh=αh(Xm)−αh(Xr)D_{\mathrm{sink}(h)} = d_{h} = \alpha_{h}(\mathbf X_{m}) - \alpha_{h}(\mathbf X_{r}) This scalar statistic quantifies the change in how each head allocates attention to the sink token between harmful and benign data.

2. Empirical Analysis: Bimodality and Correlation with Model Harmfulness

Upon measuring hh0 for each attention head over large batches from hh1 and hh2, the distribution displays a marked bimodality: roughly half of the heads have hh3 (positive sink divergence), and half have hh4. Empirical investigation reveals several key phenomena:

  • As the proportion hh5 of harmful data in fine-tuning increases, both the model’s harmfulness (quantified by the Harmful Score, HS) and the number of heads with hh6 rise correspondingly (e.g., for the Lisa baseline, heads with hh7 increase from 553 to 580 as hh8 is raised from 0 to 0.5).
  • Disabling attention heads with hh9 after fine-tuning causes HS to decrease, while disabling heads with X\mathbf X0 increases HS.

These findings establish that positive-sink heads support the acquisition and expression of unsafe behaviors, while negative-sink heads reinforce safe refusal (Liu et al., 5 Feb 2026).

3. Formal Statement of the Separable Sink Divergence Hypothesis

The hypothesis asserts a divisibility in the functional roles of attention heads by the sign of their sink divergence. Denoting

X\mathbf X1

the hypothesis posits: X\mathbf X2 This partition offers a mechanistically interpretable lens on harmful pattern acquisition: heads responsible for harmful learning are separable by a simple sign criterion on X\mathbf X3.

4. Sink Divergence Regularization: The Surgery Defense

Leveraging the separable sink divergence property, the Surgery defense is formulated as a regularization-based method for safe fine-tuning. The training objective minimizes the sum of standard cross-entropy loss and a regularizer that penalizes positive sink divergence: X\mathbf X4 where X\mathbf X5 is the cross-entropy loss, X\mathbf X6 is the set of all heads, X\mathbf X7, and X\mathbf X8 is a weighting hyperparameter.

This penalty explicitly steers all heads toward the negative-sink regime, thereby suppressing the emergence of harmful pattern heads. Backpropagation through the attention computations enables head-specific modulation of attention allocation on harmful vs. safe data during fine-tuning.

5. Experimental Validation and Mechanistic Evidence

Empirical evaluation on several safety benchmarks demonstrates the efficacy of Surgery:

Benchmark Harmful Score (HS), Surgery Baseline (Lisa) Improvement
BeaverTails 8.90% 14.80% +5.90 pp
HarmBench 9.50% 20.75% +11.25 pp
SorryBench 12.95% 22.50% +9.55 pp
  • When the fine-tuning mix contains X\mathbf X9 of harmful samples (αh(X)=1∣Nk∣∑i∈NkAh,k,i(X)\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)0 samples, GSM8K task), Surgery reduces the harmfulness below all tested baselines while matching their finetune accuracy (approximately 68.5%).
  • Post-Surgery, more than 96% of heads originally in αh(X)=1∣Nk∣∑i∈NkAh,k,i(X)\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)1 transition to αh(X)=1∣Nk∣∑i∈NkAh,k,i(X)\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)2, and sink values on harmful examples decrease layer-wise, indicating a systematic reallocation of attention away from harmful patterns.
  • Disabling heads in αh(X)=1∣Nk∣∑i∈NkAh,k,i(X)\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)3 post hoc reliably reduces HS, while disabling αh(X)=1∣Nk∣∑i∈NkAh,k,i(X)\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)4 increases HS.

6. Model-Wide Robustness, Generalization, and Limitations

Surgery exploits the intrinsic phenomenon of attention sinks, sidestepping external data or architectural modifications. Because attention sinks are a ubiquitous property of transformer LLMs, Surgery generalizes across models of varying architecture and size, including Llama3-8B, Gemma2-9B, and Qwen2-14B.

However, some early-layer heads are relatively resistant to regularization, suggesting that further improvements might be attainable with layer-specific penalties. Experimental support is presently restricted to open-source models in the 8–14B parameter regime; applicability to larger (50–100B) or proprietary models represents an open direction.

7. Implications for LLM Safety and Mechanistic Interpretability

The separable sink divergence hypothesis provides both a discriminative metric for identifying heads implicated in harmful pattern learning and a mechanism for direct intervention at the architectural level. Surgery achieves high efficacy without recourse to costly data selection or projection-based defenses. A plausible implication is that attention sink statistics offer a generic axis for diagnosing and controlling problem behaviors in large transformer models. Further extensions may address the small subset of resistant heads or evaluate transferability to other safety-critical domains (Liu et al., 5 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Separable Sink Divergence Hypothesis.