Preference Learning Algorithms Do Not Learn Preference Rankings (2405.19534v4)

Published 29 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via ranking accuracy. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the idealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant alignment gap -- i.e., a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.

PDF HTML Abstract

Preference Learning Algorithms Do Not Learn Preference Rankings

Chen et al., in their paper "Preference Learning Algorithms Do Not Learn Preference Rankings," delve into the intricacies of preference learning algorithms such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). These algorithms are crucial for aligning LLMs with human preferences, yet their performance on some fundamental tasks remains poorly understood.

Key Findings

Ranking Accuracy of Preference-Tuned Models: Chen et al. reveal that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. This is a surprising result, given the expectation that preference learning should improve ranking accuracy, which measures how well models rank preferred outputs over less preferred ones. This was observed across well-known models such as Llama 2 7B Chat, Gemma 7B IT, and Zephyr 7B DPO on datasets like UltraFeedback, Anthropic HH-RLHF, and Stanford Human Preferences (SHP).
The Alignment Gap: There is a significant alignment gap between the observed ranking accuracies and the idealized accuracies that would be achieved if models were to perfectly optimize the DPO or RLHF objectives. For example, while the observed ranking accuracy for open-access LLMs like Llama 2 7B Chat is around 53%, the idealized ranking accuracy can go up to 99%, indicating a substantial gap that needs addressing.
Difficulty in Correcting Ranking Errors: The paper identifies that DPO objectives are both theoretically and empirically ill-suited for correcting even mild ranking errors in reference models. They derive a formula that quantifies the difficulty of learning a given preference datapoint and show that existing models rarely flip the ranking of incorrect pairs during training. This is attributed to the ill-conditioning of the DPO loss when the reference model already has mild errors.
Correlation Between Ranking Accuracy and Win Rate: It is shown that ranking accuracy correlates with the win rate metric when the model is close to the reference model. However, these metrics become anti-correlated as the model moves further away, shedding light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) behaviors.

Implications

Practical Implications

The practical implications of these findings are manifold. The low ranking accuracies suggest that current preference tuning procedures may not be exploiting the full potential of human preferences data, leading to suboptimal alignment of LLMs. Practitioners might need to reassess the usage of RLHF and DPO for practical deployment in preference-sensitive applications.

Theoretical Implications

From a theoretical standpoint, the significant alignment gap and the challenges in flipping incorrect rankings highlight fundamental flaws in current preference learning objectives. This calls for developing new methodologies or rethinking existing frameworks to bridge this alignment gap effectively.

Future Directions

Refinement of Objectives:

Chen et al.'s work motivates future research into refining the RLHF and DPO objectives to reduce the alignment gap. This can involve tweaking the hyperparameters like the reference model's influence or introducing more robust ranking criteria that can better handle mild inaccuracies in initial models.

Iterative On-Policy Training:

Their findings also suggest a potential benefit from iterative or on-policy variants of these algorithms where preference data is progressively refreshed, which could enhance the alignment over multiple rounds of feedback collection and model updating.

Conclusion

Chen et al.'s examination of preference learning algorithms challenges the efficacy of current methods like RLHF and DPO in ranking accuracy improvement. By uncovering the alignment gap and the difficulty in correcting ranking inaccuracies, this paper sets the stage for future advancements in preference learning techniques, vital for developing more reliable and human-aligned LLMs. The insights provided will likely spur further research into optimizing these models' alignment with human preferences, ensuring better performance in real-world applications.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Angelica Chen (22 papers)
Sadhika Malladi (17 papers)
Lily H. Zhang (9 papers)
Xinyi Chen (78 papers)
Qiuyi Zhang (25 papers)
Rajesh Ranganath (76 papers)
Kyunghyun Cho (292 papers)

Citations (15)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_angie_chen/status/1866232119862698293

https://twitter.com/YixinLiu17/status/1815756789893275683

https://twitter.com/_angie_chen/status/1813316366218523097

https://twitter.com/SadhikaMalladi/status/1836890367305597128

https://twitter.com/SadhikaMalladi/status/1796524632104722848

https://twitter.com/SadhikaMalladi/status/1861393953183916496