- The paper demonstrates that increasing the likelihood of preferred completions does not always improve performance, highlighting a non-linear relationship with win probability.
- The research shows that while higher likelihood enhances memorisation of factual patterns, it simultaneously diminishes output diversity, impairing generalisation to new scenarios.
- The study identifies key over-optimisation indicators—decreased top-k token entropy and diminishing top-k probability mass—to guide improved model alignment strategies.
Understanding Likelihood Over-optimisation in Direct Alignment Algorithms
This paper investigates the performance of Direct Alignment Algorithms (DAAs) such as Direct Preference Optimisation (DPO) and Identity Preference Optimisation (IPO), especially focusing on the phenomenon of likelihood over-optimisation. These methods offer an alternative to traditional online Reinforcement Learning from Human Feedback (RLHF) approaches, aiming to align LLMs with human preferences without explicit reward modelling.
Key Findings
- Likelihood and Performance Relationship: Contrary to conventional expectations, it was found that increasing the likelihood of better (preferred) completions does not necessarily result in improved model performance. On the contrary, an excessively high likelihood can degrade performance. The paper observed this across DAAs and established a non-linear relationship between the likelihood of generating preferred outputs and win probability, which serves as a measure of performance.
- Impact on Generalisation and Diversity: Higher likelihoods were shown to correlate with improved memorisation of factual patterns but diminished diversity in outputs. This lack of diversity can hinder the model's ability to generalise in unseen scenarios. Lower completion likelihood, therefore, seems to enhance output diversity, contributing positively to model adaptability and performance in broader application contexts.
- Key Indicators of Over-optimisation: Two critical indicators were identified to signal output diversity over-optimisation:
- Decreasing Entropy over Top-k Tokens: This indicates a narrowing distribution of token probabilities, suggesting that high diversity might be leading to suboptimal selections.
- Diminishing Top-k Probability Mass: This condition reflects a flattening of probability distribution, leading to more random or less coherent output, potentially diverging from human preferences.
Implications
Practical Improvements: By identifying signs of over-optimisation, the paper suggests practical approaches for better alignment of LLMs with human preferences. Implementing adaptive regularisation, such as Negative Log-Likelihood Loss, can help balance the trade-off between high likelihood and model diversity, thereby enhancing generalisation.
Theoretical Insights: The findings suggest a reevaluation of traditional metrics used in preference learning. Rather than solely pursuing higher likelihoods, a focus on balanced likelihood—coupled with monitored entropy levels—could improve model training stability and outcome relevance.
Future Considerations
This research opens avenues for developing adaptive training algorithms that mitigate over-optimisation risks. Future work could explore more nuanced metrics to guide DAA training and expand upon how these insights apply to different model architectures and datasets. Additionally, further investigation into diverse training schemes and architectural changes that foster sustainable improvements in model performance and adaptability is warranted.
Overall, this paper contributes to the refinement of alignment strategy in LLMs by illustrating the complexities of likelihood optimisation, offering insights that are both practically applicable and theoretically enriching for future research in AI alignment.