Separable Sink Divergence in LLM Safety

Updated 6 April 2026

The separable sink divergence hypothesis is a framework that uses the sign of sink divergence in attention heads to differentiate between harmful and safe pattern learning in large language models.
It quantifies sink divergence as the difference in attention allocated to sink tokens, yielding a bimodal distribution that correlates with increased model harmfulness when positive.
The framework underpins the Surgery defense, which regularizes positive sink divergence during fine-tuning to reduce harmful behaviors while maintaining performance.

The separable sink divergence hypothesis is an empirical and theoretical framework for understanding and mitigating harmful fine-tuning in LLMs via the behavior of internal attention mechanisms. It posits that attention heads implicated in learning harmful patterns can be linearly separated from those supporting safe refusal by the sign of a statistic called sink divergence, thus enabling targeted defenses such as Surgery that regularize this property during further fine-tuning (Liu et al., 5 Feb 2026).

1. Definition and Computation of Sink Divergence

Sink divergence is defined at the level of individual attention heads within a transformer architecture. For a given model input sequence, an attention sink token is the position $k$ that receives the highest aggregate attention across all heads and source positions. For each attention head $h$ and input $\mathbf X$ , the mean attention weight it assigns to the sink token is: $\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)$ where $k$ is the sink token index and $\mathcal N_{k}$ the set of non-sink positions.

Given batches of harmful (malicious) samples $\mathbf X_{m}$ and safe/refusal samples $\mathbf X_{r}$ , the sink divergence for head $h$ is: $D_{\mathrm{sink}(h)} = d_{h} = \alpha_{h}(\mathbf X_{m}) - \alpha_{h}(\mathbf X_{r})$ This scalar statistic quantifies the change in how each head allocates attention to the sink token between harmful and benign data.

2. Empirical Analysis: Bimodality and Correlation with Model Harmfulness

Upon measuring $h$ 0 for each attention head over large batches from $h$ 1 and $h$ 2, the distribution displays a marked bimodality: roughly half of the heads have $h$ 3 (positive sink divergence), and half have $h$ 4. Empirical investigation reveals several key phenomena:

As the proportion $h$ 5 of harmful data in fine-tuning increases, both the model’s harmfulness (quantified by the Harmful Score, HS) and the number of heads with $h$ 6 rise correspondingly (e.g., for the Lisa baseline, heads with $h$ 7 increase from 553 to 580 as $h$ 8 is raised from 0 to 0.5).
Disabling attention heads with $h$ 9 after fine-tuning causes HS to decrease, while disabling heads with $\mathbf X$ 0 increases HS.

These findings establish that positive-sink heads support the acquisition and expression of unsafe behaviors, while negative-sink heads reinforce safe refusal (Liu et al., 5 Feb 2026).

3. Formal Statement of the Separable Sink Divergence Hypothesis

The hypothesis asserts a divisibility in the functional roles of attention heads by the sign of their sink divergence. Denoting

$\mathbf X$ 1

the hypothesis posits: $\mathbf X$ 2 This partition offers a mechanistically interpretable lens on harmful pattern acquisition: heads responsible for harmful learning are separable by a simple sign criterion on $\mathbf X$ 3.

4. Sink Divergence Regularization: The Surgery Defense

Leveraging the separable sink divergence property, the Surgery defense is formulated as a regularization-based method for safe fine-tuning. The training objective minimizes the sum of standard cross-entropy loss and a regularizer that penalizes positive sink divergence: $\mathbf X$ 4 where $\mathbf X$ 5 is the cross-entropy loss, $\mathbf X$ 6 is the set of all heads, $\mathbf X$ 7, and $\mathbf X$ 8 is a weighting hyperparameter.

This penalty explicitly steers all heads toward the negative-sink regime, thereby suppressing the emergence of harmful pattern heads. Backpropagation through the attention computations enables head-specific modulation of attention allocation on harmful vs. safe data during fine-tuning.

5. Experimental Validation and Mechanistic Evidence

Empirical evaluation on several safety benchmarks demonstrates the efficacy of Surgery:

Benchmark	Harmful Score (HS), Surgery	Baseline (Lisa)	Improvement
BeaverTails	8.90%	14.80%	+5.90 pp
HarmBench	9.50%	20.75%	+11.25 pp
SorryBench	12.95%	22.50%	+9.55 pp

When the fine-tuning mix contains $\mathbf X$ 9 of harmful samples ( $\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)$ 0 samples, GSM8K task), Surgery reduces the harmfulness below all tested baselines while matching their finetune accuracy (approximately 68.5%).
Post-Surgery, more than 96% of heads originally in $\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)$ 1 transition to $\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)$ 2, and sink values on harmful examples decrease layer-wise, indicating a systematic reallocation of attention away from harmful patterns.
Disabling heads in $\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)$ 3 post hoc reliably reduces HS, while disabling $\alpha_{h}(\mathbf X) = \frac{1}{|\mathcal N_{k}|} \sum_{i\in\mathcal N_{k}} A_{h,k,i}(\mathbf X)$ 4 increases HS.

6. Model-Wide Robustness, Generalization, and Limitations

Surgery exploits the intrinsic phenomenon of attention sinks, sidestepping external data or architectural modifications. Because attention sinks are a ubiquitous property of transformer LLMs, Surgery generalizes across models of varying architecture and size, including Llama3-8B, Gemma2-9B, and Qwen2-14B.

However, some early-layer heads are relatively resistant to regularization, suggesting that further improvements might be attainable with layer-specific penalties. Experimental support is presently restricted to open-source models in the 8–14B parameter regime; applicability to larger (50–100B) or proprietary models represents an open direction.

7. Implications for LLM Safety and Mechanistic Interpretability

The separable sink divergence hypothesis provides both a discriminative metric for identifying heads implicated in harmful pattern learning and a mechanism for direct intervention at the architectural level. Surgery achieves high efficacy without recourse to costly data selection or projection-based defenses. A plausible implication is that attention sink statistics offer a generic axis for diagnosing and controlling problem behaviors in large transformer models. Further extensions may address the small subset of resistant heads or evaluate transferability to other safety-critical domains (Liu et al., 5 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Separable Sink Divergence Hypothesis.

Separable Sink Divergence in LLM Safety

1. Definition and Computation of Sink Divergence

2. Empirical Analysis: Bimodality and Correlation with Model Harmfulness

3. Formal Statement of the Separable Sink Divergence Hypothesis

4. Sink Divergence Regularization: The Surgery Defense

5. Experimental Validation and Mechanistic Evidence

6. Model-Wide Robustness, Generalization, and Limitations

7. Implications for LLM Safety and Mechanistic Interpretability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Separable Sink Divergence in LLM Safety

1. Definition and Computation of Sink Divergence

2. Empirical Analysis: Bimodality and Correlation with Model Harmfulness

3. Formal Statement of the Separable Sink Divergence Hypothesis

4. Sink Divergence Regularization: The Surgery Defense

5. Experimental Validation and Mechanistic Evidence

6. Model-Wide Robustness, Generalization, and Limitations

7. Implications for LLM Safety and Mechanistic Interpretability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research