Learn Your Reference Model for Real Good Alignment (2404.09656v3)

Published 15 Apr 2024 in cs.LG and cs.CL

Abstract: Despite the fact that offline methods for LLMs alignment do not require a direct reward model, they remain susceptible to overoptimization. This issue arises when the trained model deviates excessively from the reference policy, leading to a decrease in sample quality. We propose a new paradigm of offline alignment methods, called Trust Region (including variants TR-DPO, TR-IPO, TR-KTO), which dynamically updates the reference policy throughout the training process. Our results show that TR alignment methods effectively mitigate overoptimization, enabling models to maintain strong performance even when substantially deviating from the initial reference policy. We demonstrate the efficacy of these approaches not only through toy examples that exhibit reduced overoptimization, but also through direct, side-by-side comparisons in specific tasks such as helpful and harmless dialogue, as well as summarization, where they surpass conventional methods. Additionally, we report significant improvements in general-purpose assistant setups with the Llama3 model on the AlpacaEval 2 and Arena-Hard benchmarks, highlighting the advantages of Trust Region methods over classical approaches.

PDF Abstract

Enhancing LLM Alignment with Trust Region Direct Preference Optimization

Introduction to TR-DPO

LLM (LM) alignment remains a vital concern in NLP, aiming to produce safe, effective, and controllable models. This paper introduces Trust Region Direct Preference Optimization (TR-DPO), a novel approach to LM alignment that advances beyond conventional Direct Preference Optimization (DPO) methods. By iteratively updating the reference policy during training, TR-DPO improves the alignment quality across several metrics, including coherence, correctness, level of detail, helpfulness, and harmlessness, showcasing significant enhancements over the DPO method.

Methodology Overview

The TR-DPO method is predicated on the idea that a static reference model limits the optimization potential of alignment techniques. The authors propose two strategies for updating the reference policy: soft updates (blending the current policy with the reference policy) and hard updates (periodically replacing the reference policy with the current policy). These methods are designed to maintain a balance between adhering to the desired output characteristics and allowing for sufficient model flexibility to learn from new data. Theoretical connections to trust region optimization methods suggest that TR-DPO strikes an optimal balance by controlling update frequency and magnitude through parameters $\alpha$ and $\tau$ .

Experimental Design and Results

The efficacy of TR-DPO was evaluated using the Anthropic HH and TLDR datasets across various model sizes of the Pythia Model architecture. Results indicated that TR-DPO outperformed DPO, with an $\alpha$ setting of 0.6 yielding up to a 19% improvement in model performance based on GPT-4 evaluations. Additionally, human-centric metrics further affirmed the superiority of TR-DPO, especially in configurations with optimized $\alpha$ and $\tau$ parameters. These findings were backed by rigorous statistical analysis and a comprehensive examination of the trade-offs between alignment accuracy and generation diversity.

Implications and Future Directions

The introduction of TR-DPO brings forth significant implications for the future of LM alignment. By dynamically updating the reference policy, TR-DPO provides a more nuanced approach to model training, allowing for continuous refinement and adaptation based on new data. This method holds promise for enhancing the quality and safety of generative AI, with potential applications extending beyond text generation to other areas of AI research and development.

Moreover, the success of TR-DPO opens avenues for future exploration, including further refinement of update parameters, broader application across different types of LMs, and investigation into the impact of dynamic reference policy updates on long-term model stability and performance.

Conclusion

TR-DPO represents a substantial step forward in the alignment of LLMs, offering a method that not only improves upon existing DPO techniques but also introduces a flexible framework for continuous model improvement. By leveraging dynamic reference policies, TR-DPO facilitates the development of more coherent, correct, detailed, helpful, and harmless generative models, underscoring the critical importance of adaptability in achieving optimal AI alignment.