Disentangling Length from Quality in Direct Preference Optimization (2403.19159v2)
Abstract: Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of LLMs. However, RLHF is know to exploit biases in human preferences, such as verbosity. A well-formatted and eloquent answer is often more highly rated by users, even when it is less helpful and objective. A number of approaches have been developed to control those biases in the classical RLHF literature, but the problem remains relatively under-explored for Direct Alignment Algorithms such as Direct Preference Optimization (DPO). Unlike classical RLHF, DPO does not train a separate reward model or use reinforcement learning directly, so previous approaches developed to control verbosity cannot be directly applied to this setting. Our work makes several contributions. For the first time, we study the length problem in the DPO setting, showing significant exploitation in DPO and linking it to out-of-distribution bootstrapping. We then develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality. We demonstrate these effects across datasets on summarization and dialogue, where we achieve up to 20\% improvement in win rates when controlling for length, despite the GPT4 judge's well-known verbosity bias.
- Gemini: A family of highly capable multimodal models.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Constitutional ai: Harmlessness from ai feedback.
- Pythia: A suite for analyzing large language models across training and scaling.
- Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Language models are few-shot learners.
- Open problems and fundamental limitations of reinforcement learning from human feedback.
- Odin: Disentangled reward mitigates hacking in rlhf.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Jack Clark and Dario Amodei. 2016. Faulty reward functions in the wild.
- Reward model ensembles help mitigate overoptimization.
- Alpacafarm: A simulation framework for methods that learn from human feedback.
- Scaling laws for reward model overoptimization. International Conference on Machine Learning.
- Mixtral of experts.
- Introducing chatgpt.
- Who answers it better? an in-depth analysis of chatgpt and stack overflow answers to software engineering questions.
- Nathan Lambert and Roberto Calandra. 2023. The alignment ceiling: Objective mismatch in reinforcement learning from human feedback.
- Holistic evaluation of language models.
- Peter Liu. 2024. [link].
- David Manheim and Scott Garrabrant. 2019. Categorizing variants of goodhart’s law.
- Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning.
- Confronting reward model overoptimization with constrained rlhf.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- The effects of reward misspecification: Mapping and mitigating misaligned models. International Conference on Learning Representations.
- Language models are unsupervised multitask learners. OpenAI.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization.
- Proximal policy optimization algorithms.
- Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback.
- A long way to go: Investigating length correlations in rlhf.
- Defining and characterizing reward hacking.
- Learning to summarize from human feedback.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- How far can camels go? exploring the state of instruction tuning on open resources. Conference on Neural Information Processing Systems Track on Datasets and Benchmarks.
- Finetuned language models are zero-shot learners. International Conference on Learning Representations.
- Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641.
- Uncertainty-penalized reinforcement learning from human feedback with diverse reward lora ensembles.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Conference on Neural Information Processing Systems Track on Datasets and Benchmarks.
- Fine-tuning language models from human preferences.
- Ryan Park (10 papers)
- Rafael Rafailov (37 papers)
- Stefano Ermon (279 papers)
- Chelsea Finn (264 papers)