2000 character limit reached
sDPO: Don't Use Your Data All at Once (2403.19270v2)
Published 28 Mar 2024 in cs.CL and cs.AI
Abstract: As development of LLMs (LLM) progresses, aligning them with human preferences has become increasingly important. We propose stepwise DPO (sDPO), an extension of the recently popularized direct preference optimization (DPO) for alignment tuning. This approach involves dividing the available preference datasets and utilizing them in a stepwise manner, rather than employing it all at once. We demonstrate that this method facilitates the use of more precisely aligned reference models within the DPO training framework. Furthermore, sDPO trains the final model to be more performant, even outperforming other popular LLMs with more parameters.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Scaling laws for transfer. arXiv preprint arXiv:2102.01293.
- Intel. 2023a. Intel/neural-chat-7b-v3-1. https://huggingface.co/Intel/neural-chat-7b-v3-1.
- Intel. 2023b. Supervised fine-tuning and direct preference optimization on intel gaudi2.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling.
- Mistralorca: Mistral-7b model instruct-tuned on filtered openorcav1 gpt-4 dataset. https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca.
- Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
- OpenAI. 2023. Gpt-4 technical report.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Teknium. 2023. teknium/openhermes-2.5-mistral-7b. https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Upstage. 2023. upstage/solar-0-70b-16bit. https://huggingface.co/upstage/SOLAR-0-70b-16bit.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.