Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization (2410.08847v4)

Published 11 Oct 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Direct Preference Optimization (DPO) and its variants are increasingly used for aligning LLMs with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

Citations (1)

View on Semantic Scholar

Summary

The paper reveals that DPO training can unintentionally reduce the likelihood of preferred responses by shifting probability toward opposite outcomes.
It introduces the centered hidden embedding similarity (CHES) score to quantify and predict likelihood displacement effects.
The findings emphasize the need for careful data curation to mitigate unalignment risks in safety-critical applications.

Analysis of "Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization"

The paper "Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization" addresses the counter-intuitive phenomenon observed in Direct Preference Optimization (DPO) where the likelihood of preferred responses, instead of increasing, actually decreases during training. This phenomenon is referred to as likelihood displacement.

Key Findings

Likelihood Displacement: The paper reveals that although DPO is designed to enhance the generation of preferred responses, the process often results in the reduction of their likelihood. This displacement can be catastrophic, with probability mass moving from preferred to semantically opposite responses. For instance, training a model to prefer "No" over "Never" can inadvertently increase the probability of "Yes".
Implications for Safety Alignment: The paper highlights severe consequences of likelihood displacement in safety-critical applications. When models are aligned to refuse unsafe prompts, displacement can reduce refusal rates significantly, thereby undermining alignment efforts.
Hidden Embedding Similarities: The research identifies centered hidden embedding similarity (CHES) score as a metric to quantify the similarity between preferences and predict likelihood displacement. The score helps pinpoint which training samples induce significant displacement.
Preventative Measures: Filtering out training samples with high CHES scores was shown to mitigate the unintentional unalignment. This emphasizes the importance of curating distinctive data preferences to prevent adverse displacement effects.

Theoretical Contributions

The paper makes significant theoretical advances by characterizing the dynamics of likelihood displacement. It shows that this displacement occurs due to geometric alignments in the model's embedding space. Specifically, similar unembedding vectors of preferred and dispreferred tokens exacerbate displacement.

Practical Implications

Practically, the findings suggest that direct preference learning requires careful data curation and potentially the use of additional mechanisms, such as SFT regularization, to prevent harmful unalignment, particularly in safety-critical tasks.

Future Directions

The research opens several avenues for future exploration:

Investigation Across Architectures: Studying likelihood displacement across different model architectures to assess generality.
Refined Data Curation Methods: Developing automated data curation techniques incorporating CHES scores to streamline preference learning.
Extended Analysis to Larger Models: Exploring whether similar displacement dynamics occur in larger or more complex model architectures.

Overall, the paper provides a robust analysis of the unintentional unalignment caused by likelihood displacement in DPO, offering both theoretical insights and practical solutions to mitigate these effects.

PDF Markdown

Related Papers

Tweets

https://twitter.com/natolambert/status/1845885247902130616

https://twitter.com/noamrazin/status/1845859648114450929

https://twitter.com/SadhikaMalladi/status/1845975680544088497

https://twitter.com/noamrazin/status/1867744221874336226

https://twitter.com/fly51fly/status/1848113605268664730

https://twitter.com/gm8xx8/status/1845997926608253237

YouTube

Show All Videos