Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences (2403.07230v2)
Abstract: Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (usually one chosen and rejected response pair per user prompt) to align LLMs to human preferences. In practice, multiple responses can exist for a given prompt with varying quality relative to each other. With availability of such quality ratings for multiple responses, we propose utilizing these responses to create multiple preference pairs for a given prompt. Our work focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology. In particular, we order these multiple pairs of preference data from easy to hard (emulating curriculum training) according to various criteria. We show detailed comparisons of our proposed approach to the standard single-pair DPO setting. Our method, which we call Curry-DPO consistently shows increased performance gains on MTbench, Vicuna, WizardLM, and the UltraFeedback test set, highlighting its effectiveness. More specifically, Curry-DPO achieves a score of 7.43 on MT-bench with Zephy-7B model outperforming majority of existing LLMs with similar parameter size. Curry-DPO also achieves the highest adjusted win rates on Vicuna, WizardLM, and UltraFeedback test datasets (90.7%, 87.1%, and 87.9% respectively) in our experiments, with notable gains of upto 7.5% when compared to standard DPO technique. We release the preference pairs used in alignment at: https://huggingface.co/datasets/ServiceNow-AI/Curriculum_DPO_preferences
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, page 41–48, New York, NY, USA. Association for Computing Machinery.
- Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
- Curriculum design for code-switching: Experiments with language identification and language modeling with deep neural networks. In Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017), pages 65–74, Kolkata, India. NLP Association of India.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
- Jeffrey L. Elman. 1993. Learning and development in neural networks: the importance of starting small. Cognition, 48:71–99.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Llm-blender: Ensembling large language models with pairwise comparison and generative fusion. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023).
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
- Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
- Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958.
- Kai A. Krueger and Peter Dayan. 2009. Flexible shaping: How learning in small steps helps. Cognition, 110:380–394.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Lipo: Listwise preference optimization through learning-to-rank. arXiv preprint arXiv:2402.01878.
- Jinliang Lu and Jiajun Zhang. 2021. Exploiting curriculum learning in unsupervised neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 924–934, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Gail Beaton Peterson. 2004. A day of great illumination: B. f. skinner’s discovery of shaping. Journal of the experimental analysis of behavior, 82 3:317–28.
- Competence-based curriculum learning for neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1162–1172, Minneapolis, Minnesota. Association for Computational Linguistics.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Mrinmaya Sachan and Eric Xing. 2016. Easy questions first? a case study on curriculum learning for question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 453–463, Berlin, Germany. Association for Computational Linguistics.
- Mrinmaya Sachan and Eric Xing. 2018. Self-training for jointly learning to ask and answer questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 629–640, New Orleans, Louisiana. Association for Computational Linguistics.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4922–4931, Florence, Italy. Association for Computational Linguistics.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Step-on-feet tuning: Scaling self-alignment of llms via bootstrapping. arXiv preprint arXiv:2402.07610.
- Curriculum learning for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6095–6104, Online. Association for Computational Linguistics.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
- Curriculum learning for domain adaptation in neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1903–1915, Minneapolis, Minnesota. Association for Computational Linguistics.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708.
- Pulkit Pattnaik (1 paper)
- Rishabh Maheshwary (14 papers)
- Kelechi Ogueji (14 papers)
- Vikas Yadav (38 papers)
- Sathwik Tejaswi Madhusudhan (10 papers)