LLM Safety Alignment is Divergence Estimation in Disguise (2502.00657v2)

Published 2 Feb 2025 in cs.LG, cs.AI, cs.CY, and stat.ML

Abstract: We present a theoretical framework showing that popular LLM alignment methods, including RLHF and its variants, can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance-refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.

Summary

The paper establishes a theoretical framework interpreting LLM alignment methods as estimating divergences between safe and harmful response distributions.
The framework shows minimizing alignment loss maximizes divergence, inducing separation between safe and harmful prompts in the LLM's latent space, empirically validated.
The analysis highlights that compliance-refusal datasets yield stronger latent space separation and robustness for safety alignment than preference datasets.

The paper establishes a rigorous theoretical framework that interprets LLM alignment methods—such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), KTO, and binary classification optimizer (BCO)—as implicitly estimating divergences between aligned (safe or preferred) and unaligned (harmful or less-preferred) response distributions. The central thesis is that the training objectives of many alignment methods can be viewed as variational estimators for well‐known divergence metrics (for example, total variation (TV), Jensen–Shannon (JS), and Kullback–Leibler (KL) divergences).

Key Theoretical Contributions and Framework

Divergence Estimation Interpretation:

The paper shows that standard alignment losses satisfy relationships of the form

$L_{\text{KTO}}(\theta^*) = - D_{\text{TV}}(\mathcal{D}^+ \|\mathcal{D}^-)+1,\quad L_{\text{BCO}}(\theta^*) = \ln4 - 2D_{\text{JS}}(\mathcal{D}^+ \|\mathcal{D}^-),$

and that DPO is lower bounded by a negative TV divergence (i.e. $L_{\text{DPO}}(\theta^*) = \Omega(-D_{\text{TV}}(\mathcal{D}^+ \|\mathcal{D}^-)).$ This solidifies the understanding that by minimizing the alignment loss, these methods effectively maximize the separation between the aligned and unaligned distributions.

New Methods (KLDO and FDO):

Motivated by the sensitivity of the KL divergence in measuring large separations, the authors introduce KLDO—a KL-divergence optimizer—which parameterizes the alignment objective using the Donsker–Varadhan (DV) representation: $L_{\text{KLDO}}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}^+} \, r_\theta(x,y) - \ln\, \mathbb{E}_{(x,y) \sim \mathcal{D}^-} \, e^{r_\theta(x,y)}.$ Furthermore, they generalize this construction to create FDO (an optimizer for a broad class of $f$ -divergences) by introducing a link function $g$ such that

$L_{\text{FDO}}(\theta) = -\mathbb{E}_{(x,y)\sim\mathcal{D}^+} g\bigl(r_\theta(x,y)\bigr) - \mathbb{E}_{(x,y)\sim\mathcal{D}^-} \, f^*\bigl(g\bigl(r_\theta(x,y)\bigr)\bigr).$

Here, $f^*$ denotes the convex conjugate of $f$ .

Alignment Consistency and Safety Separation:

A central concept introduced is that of alignment consistency. A method is alignment consistent if, at the optimum, the probability of generating a response becomes a non-decreasing function of the likelihood ratio

$R(x,y)=\frac{p_{\mathcal{D}^+}(y\mid x)}{p_{\mathcal{D}^-}(y\mid x)},$

so that the final policy takes the form

$\pi(y\mid x) = Z(x)^{-1} \,\pi_0(y\mid x) \cdot h\bigl(R(x,y)\bigr),$

with $h(\cdot)$ non-decreasing and non-constant. The paper provides closed-form expressions for $h$ for each method (e.g., for KLDO and BCO, $h(u)=u^{\frac{1}{\beta}}$ ), ensuring that safe and harmful prompts are separated in the model’s hidden space. Under this consistency, a naive Bayes classifier built on the model’s outputs achieves perfect classification of the underlying safety label, proving that alignment methods not only shift the output distribution but also cluster the latent representations.

Implications for Data Types – Compliance–Refusal vs. Preference:

The theoretical results imply that using compliance–refusal (CR) datasets—where, for safe prompts, compliant responses are paired against pre-defined refusals and vice versa for harmful prompts—can lead to a stronger separation than using preference datasets, where both responses may be compliant but are ranked differently. Through their analysis, the authors show that the conditional probability $p(z=t \mid x,\theta^*)$ is higher when training on CR datasets, which in turn improves the separation (and hence robustness) of the model to adversarial or jailbreak attacks.

Empirical Validation and Numerical Results

Latent Space Visualization:

The paper provides PCA-based visualizations of the last-layer embeddings of various models before and after alignment. Aligned models (using DPO, KTO, BCO, and KLDO) consistently exhibit clear clustering and separation between safe and harmful prompts, unlike unaligned base models.

Quantitative Metrics of Separation:

Two metrics are introduced: the Bhattacharyya Distance $(D_B)$ and the Silhouette Score $(s)$ . For instance, for models such as Qwen2.5–1.5B, KLDO achieves a Bhattacharyya Distance of 9.19 and a Silhouette Score of 0.68, indicating substantial latent separation. Additionally, the paper finds a statistically significant negative Pearson correlation (–0.44, $p=0.027$ ) between the Bhattacharyya Distance and the attack success rate (ASR), underscoring that better separation correlates with enhanced robustness.

Robustness Versus Utility Trade-offs:

Balancing robustness (measured by ASR on adversarial benchmarks) with utility (assessed through win rates on instruction-following evaluations such as Alpaca Eval) is critical. KLDO consistently exhibits low ASR and competitive win rates across various models (e.g., achieving as low as 0.19% ASR on one configuration while maintaining a win rate that rivals or exceeds other methods), thereby demonstrating an optimal balance.

Dataset Impact:

When comparing alignment on preference versus compliance–refusal datasets, the authors report, for example, that training with CR data can lead to a Bhattacharyya Distance increase of over 127% for KLDO, along with a 51.30% reduction in ASR. This quantitative evidence supports the theoretical claim that CR datasets strengthen safety alignment via more effective separation in latent space.

Technical and Mathematical Rigor

The theoretical sections include careful derivations using variational representations of divergences (e.g., the Donsker–Varadhan representation for KL divergence and analogous formulations for TV and JS divergences). The authors also provide proofs showing that under the optimal alignment (i.e., at $\theta^*$ ), the alignment losses converge to their respective divergence metrics. This not only justifies the new KLDO and FDO methods but also explains the empirical saturation observed in DPO when extreme separation is reached.

Summary

The paper provides a unified, rigorous interpretation of LLM alignment methods as divergence estimators. By showing that traditional and novel alignment losses correspond to variational representations of fundamental divergence metrics, it explains why alignment induces a separation in latent space between safe and harmful prompts. The incorporation of alignment consistency further clarifies how the optimal aligned policy implicitly favors responses with higher likelihood ratios. Empirical validations using multiple LLMs and both qualitative (latent space plots) and quantitative (Bhattacharyya Distance, Silhouette Score, ASR, win rates) metrics underline the practical relevance of the theoretical results. Moreover, the exploration of compliance–refusal versus preference datasets yields actionable insights for improving LLM safety alignment.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/XINGYue2/status/1888661377096564757