Understanding Likelihood Over-optimisation in Direct Alignment Algorithms (2410.11677v2)

Published 15 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Direct Alignment Algorithms (DAAs), such as Direct Preference Optimisation (DPO) and Identity Preference Optimisation (IPO), have emerged as alternatives to online Reinforcement Learning from Human Feedback (RLHF) algorithms such as Proximal Policy Optimisation (PPO) for aligning LLMs to human preferences, without the need for explicit reward modelling. These methods generally aim to increase the likelihood of generating better (preferred) completions while discouraging worse (non-preferred) ones, while staying close to the original model's behaviour. In this work, we explore the relationship between completion likelihood and model performance in state-of-the-art DAAs, and identify a critical issue of likelihood over-optimisation. Contrary to expectations, we find that higher likelihood of better completions and larger margins between better and worse completion likelihoods do not necessarily lead to better performance, and may even degrade it. Our analysis reveals that while higher likelihood correlates with better memorisation of factual knowledge patterns, a slightly lower completion likelihood tends to improve output diversity, thus leading to better generalisation to unseen scenarios. Moreover, we identify two key indicators that signal when over-optimised output diversity begins to harm performance: Decreasing Entropy over Top-k Tokens and Diminishing Top-k Probability Mass. Our experimental results validate that these indicators are reliable signs of declining performance under different regularisations, helping prevent over-optimisation and improve alignment with human preferences.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that increasing the likelihood of preferred completions does not always improve performance, highlighting a non-linear relationship with win probability.
The research shows that while higher likelihood enhances memorisation of factual patterns, it simultaneously diminishes output diversity, impairing generalisation to new scenarios.
The study identifies key over-optimisation indicators—decreased top-k token entropy and diminishing top-k probability mass—to guide improved model alignment strategies.

Understanding Likelihood Over-optimisation in Direct Alignment Algorithms

This paper investigates the performance of Direct Alignment Algorithms (DAAs) such as Direct Preference Optimisation (DPO) and Identity Preference Optimisation (IPO), especially focusing on the phenomenon of likelihood over-optimisation. These methods offer an alternative to traditional online Reinforcement Learning from Human Feedback (RLHF) approaches, aiming to align LLMs with human preferences without explicit reward modelling.

Key Findings

Likelihood and Performance Relationship: Contrary to conventional expectations, it was found that increasing the likelihood of better (preferred) completions does not necessarily result in improved model performance. On the contrary, an excessively high likelihood can degrade performance. The paper observed this across DAAs and established a non-linear relationship between the likelihood of generating preferred outputs and win probability, which serves as a measure of performance.
Impact on Generalisation and Diversity: Higher likelihoods were shown to correlate with improved memorisation of factual patterns but diminished diversity in outputs. This lack of diversity can hinder the model's ability to generalise in unseen scenarios. Lower completion likelihood, therefore, seems to enhance output diversity, contributing positively to model adaptability and performance in broader application contexts.
Key Indicators of Over-optimisation: Two critical indicators were identified to signal output diversity over-optimisation:
- Decreasing Entropy over Top-k Tokens: This indicates a narrowing distribution of token probabilities, suggesting that high diversity might be leading to suboptimal selections.
- Diminishing Top-k Probability Mass: This condition reflects a flattening of probability distribution, leading to more random or less coherent output, potentially diverging from human preferences.

Implications

Practical Improvements: By identifying signs of over-optimisation, the paper suggests practical approaches for better alignment of LLMs with human preferences. Implementing adaptive regularisation, such as Negative Log-Likelihood Loss, can help balance the trade-off between high likelihood and model diversity, thereby enhancing generalisation.

Theoretical Insights: The findings suggest a reevaluation of traditional metrics used in preference learning. Rather than solely pursuing higher likelihoods, a focus on balanced likelihood—coupled with monitored entropy levels—could improve model training stability and outcome relevance.

Future Considerations

This research opens avenues for developing adaptive training algorithms that mitigate over-optimisation risks. Future work could explore more nuanced metrics to guide DAA training and expand upon how these insights apply to different model architectures and datasets. Additionally, further investigation into diverse training schemes and architectural changes that foster sustainable improvements in model performance and adaptability is warranted.

Overall, this paper contributes to the refinement of alignment strategy in LLMs by illustrating the complexities of likelihood optimisation, offering insights that are both practically applicable and theoretically enriching for future research in AI alignment.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (5)

Tweets

https://twitter.com/ZhengxiangShi/status/1847231612808827010

YouTube

Show All Videos