DPO-Shift: Shifting the Distribution of Direct Preference Optimization (2502.07599v2)

Published 11 Feb 2025 in cs.CL

Abstract: Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning LLMs with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce DPO-Shift to controllably shift the distribution of the chosen probability. Then, we show that DPO-Shift exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of DPO-Shift over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.

PDF Abstract

DPO-Shift: A Novel Approach for Addressing Likelihood Displacement in Direct Preference Optimization

The paper "DPO-Shift: Shifting the Distribution of Direct Preference Optimization" presents a novel methodology aimed at mitigating the well-documented likelihood displacement issue in Direct Preference Optimization (DPO). This paper is centered on the alignment of LLMs with human preferences, a field where DPO and its variants have garnered significant attention. Despite its efficacy, DPO is challenged by a phenomenon where the probability of preferred responses diminishes over training, raising concerns about the model's generalization capability.

Contributions and Methodology

The paper introduces DPO-Shift, a variation of the original DPO that employs a parameter function $f(X)$ to adjust the reward function of rejected responses within the Bradley-Terry model framework. The central proposition is that DPO-Shift can effectively address the likelihood displacement problem by selectively shifting the probability distribution of chosen responses without necessitating dataset modifications. The authors achieve this by introducing $f(X)$ as a modulation term, which is crucial in balancing the likelihood of chosen responses against the reward margin — the crux of their proposed theoretical contribution.

The paper delineates a fundamental trade-off identified through both rigorous theoretical analysis and empirical validation. Specifically, while DPO-Shift increases the likelihood of the chosen response, it may concurrently compromise the reward margin sought by DPO, thus requiring a strategic selection of $f(X)$ to optimize this balance. The intricate theorization demonstrates how varying $f(X)$ closer to one may yield significant improvement in chosen probability without a considerable detriment to the reward margin.

Experimental Verification

The experimental paradigm encompasses extensive ablation studies using prominent models like Llama 3-8B and Qwen 2-7B, trained on datasets such as UltraFeedback and Capybara-preferences. These experiments confirm the theoretical groundings of DPO-Shift, illustrating its capacity to alleviate likelihood displacement. Notably, the paper reports distinct shifts in preference models when utilizing DPO-Shift, corroborating the positive variance in chosen response probabilities while managing minor fluctuations in reward margins, as shown in Figures 2-5.

Additionally, the downstream performance assessment involving MT-Bench and designed win rate experiments equips substantial evidence about the supremacy of DPO-Shift over traditional DPO in specific configurations. The models augmented with the DPO-Shift strategy notably outperformed counterparts on MT-Bench metrics and in generating answers deemed closer to reference responses, as adjudicated by state-of-the-art LLMs. Notably, DPO-Shift displayed enhanced perplexity metrics, indicating improved model certainty in prediction tasks.

Implications and Future Prospects

This paper contributes significantly to theoretical and practical advancements in model alignment with human preferences, suggesting avenues for more precise and efficient response generation within AI systems. Practically, DPO-Shift offers an approach to refine LLM outputs for improved human alignment without extensive computational overhead or data preprocessing.

Future directions as indicated involve a more nuanced selection strategy for $f(X)$ that can further exploit the balance between chosen probability enhancement and reward margin retention. Exploring adaptive schemes could lead to dynamic optimization pathways, fostering continuous improvements in LLM alignments.

In conclusion, DPO-Shift stands as an impactful contribution towards addressing inherent challenges within DPO — particularly the likelihood displacement — offering a theoretically sound and empirically validated path forward in LLM preference alignment.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Xiliang Yang (7 papers)
Feng Jiang (97 papers)
Qianen Zhang (1 paper)
Lei Zhao (808 papers)
Xiao Li (354 papers)

Related Papers

Find Related Papers

GitHub

GitHub - Meaquadddd/DPO-Shift

Tweets

https://twitter.com/Phanron_xli/status/1889867207473353196

https://twitter.com/arXivGPT/status/1890462596278431851

YouTube

Show All Videos