Separable Sink Divergence in LLM Safety
- The separable sink divergence hypothesis is a framework that uses the sign of sink divergence in attention heads to differentiate between harmful and safe pattern learning in large language models.
- It quantifies sink divergence as the difference in attention allocated to sink tokens, yielding a bimodal distribution that correlates with increased model harmfulness when positive.
- The framework underpins the Surgery defense, which regularizes positive sink divergence during fine-tuning to reduce harmful behaviors while maintaining performance.
The separable sink divergence hypothesis is an empirical and theoretical framework for understanding and mitigating harmful fine-tuning in LLMs via the behavior of internal attention mechanisms. It posits that attention heads implicated in learning harmful patterns can be linearly separated from those supporting safe refusal by the sign of a statistic called sink divergence, thus enabling targeted defenses such as Surgery that regularize this property during further fine-tuning (Liu et al., 5 Feb 2026).
1. Definition and Computation of Sink Divergence
Sink divergence is defined at the level of individual attention heads within a transformer architecture. For a given model input sequence, an attention sink token is the position that receives the highest aggregate attention across all heads and source positions. For each attention head and input , the mean attention weight it assigns to the sink token is: where is the sink token index and the set of non-sink positions.
Given batches of harmful (malicious) samples and safe/refusal samples , the sink divergence for head is: This scalar statistic quantifies the change in how each head allocates attention to the sink token between harmful and benign data.
2. Empirical Analysis: Bimodality and Correlation with Model Harmfulness
Upon measuring 0 for each attention head over large batches from 1 and 2, the distribution displays a marked bimodality: roughly half of the heads have 3 (positive sink divergence), and half have 4. Empirical investigation reveals several key phenomena:
- As the proportion 5 of harmful data in fine-tuning increases, both the model’s harmfulness (quantified by the Harmful Score, HS) and the number of heads with 6 rise correspondingly (e.g., for the Lisa baseline, heads with 7 increase from 553 to 580 as 8 is raised from 0 to 0.5).
- Disabling attention heads with 9 after fine-tuning causes HS to decrease, while disabling heads with 0 increases HS.
These findings establish that positive-sink heads support the acquisition and expression of unsafe behaviors, while negative-sink heads reinforce safe refusal (Liu et al., 5 Feb 2026).
3. Formal Statement of the Separable Sink Divergence Hypothesis
The hypothesis asserts a divisibility in the functional roles of attention heads by the sign of their sink divergence. Denoting
1
the hypothesis posits: 2 This partition offers a mechanistically interpretable lens on harmful pattern acquisition: heads responsible for harmful learning are separable by a simple sign criterion on 3.
4. Sink Divergence Regularization: The Surgery Defense
Leveraging the separable sink divergence property, the Surgery defense is formulated as a regularization-based method for safe fine-tuning. The training objective minimizes the sum of standard cross-entropy loss and a regularizer that penalizes positive sink divergence: 4 where 5 is the cross-entropy loss, 6 is the set of all heads, 7, and 8 is a weighting hyperparameter.
This penalty explicitly steers all heads toward the negative-sink regime, thereby suppressing the emergence of harmful pattern heads. Backpropagation through the attention computations enables head-specific modulation of attention allocation on harmful vs. safe data during fine-tuning.
5. Experimental Validation and Mechanistic Evidence
Empirical evaluation on several safety benchmarks demonstrates the efficacy of Surgery:
| Benchmark | Harmful Score (HS), Surgery | Baseline (Lisa) | Improvement |
|---|---|---|---|
| BeaverTails | 8.90% | 14.80% | +5.90 pp |
| HarmBench | 9.50% | 20.75% | +11.25 pp |
| SorryBench | 12.95% | 22.50% | +9.55 pp |
- When the fine-tuning mix contains 9 of harmful samples (0 samples, GSM8K task), Surgery reduces the harmfulness below all tested baselines while matching their finetune accuracy (approximately 68.5%).
- Post-Surgery, more than 96% of heads originally in 1 transition to 2, and sink values on harmful examples decrease layer-wise, indicating a systematic reallocation of attention away from harmful patterns.
- Disabling heads in 3 post hoc reliably reduces HS, while disabling 4 increases HS.
6. Model-Wide Robustness, Generalization, and Limitations
Surgery exploits the intrinsic phenomenon of attention sinks, sidestepping external data or architectural modifications. Because attention sinks are a ubiquitous property of transformer LLMs, Surgery generalizes across models of varying architecture and size, including Llama3-8B, Gemma2-9B, and Qwen2-14B.
However, some early-layer heads are relatively resistant to regularization, suggesting that further improvements might be attainable with layer-specific penalties. Experimental support is presently restricted to open-source models in the 8–14B parameter regime; applicability to larger (50–100B) or proprietary models represents an open direction.
7. Implications for LLM Safety and Mechanistic Interpretability
The separable sink divergence hypothesis provides both a discriminative metric for identifying heads implicated in harmful pattern learning and a mechanism for direct intervention at the architectural level. Surgery achieves high efficacy without recourse to costly data selection or projection-based defenses. A plausible implication is that attention sink statistics offer a generic axis for diagnosing and controlling problem behaviors in large transformer models. Further extensions may address the small subset of resistant heads or evaluate transferability to other safety-critical domains (Liu et al., 5 Feb 2026).