sDPO: Don't Use Your Data All at Once (2403.19270v2)

Published 28 Mar 2024 in cs.CL and cs.AI

Abstract: As development of LLMs (LLM) progresses, aligning them with human preferences has become increasingly important. We propose stepwise DPO (sDPO), an extension of the recently popularized direct preference optimization (DPO) for alignment tuning. This approach involves dividing the available preference datasets and utilizing them in a stepwise manner, rather than employing it all at once. We demonstrate that this method facilitates the use of more precisely aligned reference models within the DPO training framework. Furthermore, sDPO trains the final model to be more performant, even outperforming other popular LLMs with more parameters.

References (32)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces sDPO, an incremental approach to Direct Preference Optimization that enhances alignment by training iteratively with segmented preference datasets.
The method strategically employs previously aligned models as reference targets to maintain consistent performance improvement during each training step.
Experimental evaluations on models like Mistral-7B-OpenOrca demonstrate superior performance and truthfulness, evidenced by higher H4 and TruthfulQA scores.

sDPO: An Incremental Approach to Direct Preference Optimization

The paper "sDPO: Don't Use Your Data All at Once" introduces stepwise Direct Preference Optimization (sDPO), a nuanced extension of Direct Preference Optimization (DPO) aimed at enhancing the alignment tuning of LLMs. Aligning LLMs with human preferences is crucial for ensuring their safety and efficacy in generating natural language text. The authors propose a method that strategically divides the preference datasets and utilizes them incrementally to improve alignment outcomes.

Methodological Insights

The primary contribution of this paper is the development of sDPO. By employing a stepwise use of preference datasets, the method trains the model iteratively rather than using all data simultaneously, which is the standard approach in traditional DPO. This stepwise approach allows for the incorporation of progressively more aligned reference models during training, thereby maintaining a more consistent alignment as the training progresses.

The procedure involves using the aligned model from the previous step as the reference and target model for the current step, enhancing the alignment model's performance by using a lower bound that is already aligned to preferences to some extent. This method effectively contrasts with traditional methods that might rely on weaker, unaligned datasets.

Experimental Evaluation

The sDPO method was evaluated using models such as Mistral-7B-OpenOrca and OpenHermes-2.5-Mistral-7B, which are known for their performance efficiency. Significantly, the new approach exhibited superior performance by achieving higher H4 scores than models with substantially more parameters. The paper presented compelling quantitative results; for instance, the use of Intel-7B-DPO as the reference model resulted in top performance metrics, demonstrating the importance of the initial alignment level of reference models.

Further empirical assessment using multiple benchmark tasks from the HuggingFace Open LLM Leaderboard corroborated the enhanced performance of models trained with sDPO, with noticeable improvements in truthfulness and alignment, as reflected in the TruthfulQA scores.

Implications and Future Directions

From a theoretical perspective, the sDPO method refines the understanding of curriculum learning in LLM training, where models are exposed incrementally to more challenging datasets. Practically, this method presents a scalable and effective solution for achieving optimal model alignment without resorting to excessively large model architectures or datasets.

The future implications of sDPO are wide-reaching. By refining the use of existing preference datasets and enhancing model alignment incrementally, this method sets a foundation for exploring new strategies in preference optimization. An open question remains regarding the optimal segmentation of preference datasets for maximizing performance, a query that may drive the next phase of research.

In summary, the paper provides a comprehensive analysis of how stepwise preference dataset utilization can enhance LLM alignment tuning, placing sDPO as a compelling candidate for advancing AI alignment research. By leveraging more aligned reference models, sDPO offers a sophisticated approach that could influence future development practices in implementing alignment strategies in natural language processing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1773571320627790178

https://twitter.com/fly51fly/status/1773825488412381396

https://twitter.com/arxivsanitybot/status/1774256126881841326

https://twitter.com/Montreal_IA/status/1773743164676845771

https://twitter.com/ceobillionaire/status/1773742986402427322

https://twitter.com/Montreal_AI/status/1773740445425676592