Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

sDPO: Don't Use Your Data All at Once (2403.19270v2)

Published 28 Mar 2024 in cs.CL and cs.AI

Abstract: As development of LLMs (LLM) progresses, aligning them with human preferences has become increasingly important. We propose stepwise DPO (sDPO), an extension of the recently popularized direct preference optimization (DPO) for alignment tuning. This approach involves dividing the available preference datasets and utilizing them in a stepwise manner, rather than employing it all at once. We demonstrate that this method facilitates the use of more precisely aligned reference models within the DPO training framework. Furthermore, sDPO trains the final model to be more performant, even outperforming other popular LLMs with more parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  3. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  8. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
  9. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
  10. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  11. Scaling laws for transfer. arXiv preprint arXiv:2102.01293.
  12. Intel. 2023a. Intel/neural-chat-7b-v3-1. https://huggingface.co/Intel/neural-chat-7b-v3-1.
  13. Intel. 2023b. Supervised fine-tuning and direct preference optimization on intel gaudi2.
  14. Camels in a changing climate: Enhancing lm adaptation with tulu 2.
  15. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  16. Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling.
  17. Mistralorca: Mistral-7b model instruct-tuned on filtered openorcav1 gpt-4 dataset. https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca.
  18. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
  19. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  20. OpenAI. 2023. Gpt-4 technical report.
  21. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  22. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  23. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  24. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  25. Teknium. 2023. teknium/openhermes-2.5-mistral-7b. https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B.
  26. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  27. Upstage. 2023. upstage/solar-0-70b-16bit. https://huggingface.co/upstage/SOLAR-0-70b-16bit.
  28. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  29. Self-rewarding language models. arXiv preprint arXiv:2401.10020.
  30. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
  31. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
  32. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Citations (18)

Summary

  • The paper introduces sDPO, an incremental approach to Direct Preference Optimization that enhances alignment by training iteratively with segmented preference datasets.
  • The method strategically employs previously aligned models as reference targets to maintain consistent performance improvement during each training step.
  • Experimental evaluations on models like Mistral-7B-OpenOrca demonstrate superior performance and truthfulness, evidenced by higher H4 and TruthfulQA scores.

sDPO: An Incremental Approach to Direct Preference Optimization

The paper "sDPO: Don't Use Your Data All at Once" introduces stepwise Direct Preference Optimization (sDPO), a nuanced extension of Direct Preference Optimization (DPO) aimed at enhancing the alignment tuning of LLMs. Aligning LLMs with human preferences is crucial for ensuring their safety and efficacy in generating natural language text. The authors propose a method that strategically divides the preference datasets and utilizes them incrementally to improve alignment outcomes.

Methodological Insights

The primary contribution of this paper is the development of sDPO. By employing a stepwise use of preference datasets, the method trains the model iteratively rather than using all data simultaneously, which is the standard approach in traditional DPO. This stepwise approach allows for the incorporation of progressively more aligned reference models during training, thereby maintaining a more consistent alignment as the training progresses.

The procedure involves using the aligned model from the previous step as the reference and target model for the current step, enhancing the alignment model's performance by using a lower bound that is already aligned to preferences to some extent. This method effectively contrasts with traditional methods that might rely on weaker, unaligned datasets.

Experimental Evaluation

The sDPO method was evaluated using models such as Mistral-7B-OpenOrca and OpenHermes-2.5-Mistral-7B, which are known for their performance efficiency. Significantly, the new approach exhibited superior performance by achieving higher H4 scores than models with substantially more parameters. The paper presented compelling quantitative results; for instance, the use of Intel-7B-DPO as the reference model resulted in top performance metrics, demonstrating the importance of the initial alignment level of reference models.

Further empirical assessment using multiple benchmark tasks from the HuggingFace Open LLM Leaderboard corroborated the enhanced performance of models trained with sDPO, with noticeable improvements in truthfulness and alignment, as reflected in the TruthfulQA scores.

Implications and Future Directions

From a theoretical perspective, the sDPO method refines the understanding of curriculum learning in LLM training, where models are exposed incrementally to more challenging datasets. Practically, this method presents a scalable and effective solution for achieving optimal model alignment without resorting to excessively large model architectures or datasets.

The future implications of sDPO are wide-reaching. By refining the use of existing preference datasets and enhancing model alignment incrementally, this method sets a foundation for exploring new strategies in preference optimization. An open question remains regarding the optimal segmentation of preference datasets for maximizing performance, a query that may drive the next phase of research.

In summary, the paper provides a comprehensive analysis of how stepwise preference dataset utilization can enhance LLM alignment tuning, placing sDPO as a compelling candidate for advancing AI alignment research. By leveraging more aligned reference models, sDPO offers a sophisticated approach that could influence future development practices in implementing alignment strategies in natural language processing.