Direct Preference Optimization With Unobserved Preference Heterogeneity (2405.15065v1)
Abstract: RLHF has emerged as a pivotal step in aligning LLMs with human objectives and values. It typically involves learning a reward model from human preference data and then using reinforcement learning to update the generative model accordingly. Conversely, Direct Preference Optimization (DPO) directly optimizes the generative model with preference data, skipping reinforcement learning. However, both RLHF and DPO assume uniform preferences, overlooking the reality of diverse human annotators. This paper presents a new method to align generative models with varied human preferences. We propose an Expectation-Maximization adaptation to DPO, generating a mixture of models based on latent preference types of the annotators. We then introduce a min-max regret ensemble learning model to produce a single generative method to minimize worst-case regret among annotator subgroups with similar latent factors. Our algorithms leverage the simplicity of DPO while accommodating diverse preferences. Experimental results validate the effectiveness of our approach in producing equitable generative policies.
- Advances in preference-based reinforcement learning: A review. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2527–2532. IEEE, 2022.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
- Aligning robot and human representations. arXiv preprint arXiv:2302.01928, 2023.
- Settling the reward hypothesis. In International Conference on Machine Learning, pages 3003–3020. PMLR, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences. arXiv preprint arXiv:2402.08925, 2024.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Social choice for ai alignment: Dealing with diverse human feedback. arXiv preprint arXiv:2404.10271, 2024.
- Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
- Balanced linear contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3445–3453, 2019.
- Learning the preferences of ignorant, inconsistent agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
- From language to goals: Inverse reinforcement learning for vision-based instruction following. arXiv preprint arXiv:1902.07742, 2019.
- On the sensitivity of reward inference to misspecified human models. arXiv preprint arXiv:2212.04717, 2022.
- Sequential preference ranking for efficient reinforcement learning from human feedback. Advances in Neural Information Processing Systems, 36, 2024.
- Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31, 2018.
- Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023.
- B-pref: Benchmarking preference-based reinforcement learning. arXiv preprint arXiv:2111.03026, 2021.
- Aligning crowd feedback via distributional preference reward modeling. arXiv preprint arXiv:2402.09764, 2024.
- Humans are not boltzmann distributions: Challenges and opportunities for modelling human feedback and interaction in reinforcement learning. arXiv preprint arXiv:2206.13316, 2022.
- Risk-sensitive inverse reinforcement learning via coherent risk models. In Robotics: science and systems, volume 16, page 117, 2017.
- Todd K Moon. The expectation-maximization algorithm. IEEE Signal processing magazine, 13(6):47–60, 1996.
- Reinforcement learning for bandit neural machine translation with simulated human feedback. arXiv preprint arXiv:1707.07402, 2017.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Principled rlhf from heterogeneous feedback via personalization and preference aggregation. arXiv preprint arXiv:2405.00254, 2024.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Optimization, learning, and games with predictable sequences. Advances in Neural Information Processing Systems, 26, 2013.
- Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. Advances in Neural Information Processing Systems, 36, 2024.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- On the feasibility of learning, rather than assuming, human biases for reward inference. In International Conference on Machine Learning, pages 5670–5679. PMLR, 2019.
- Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023.
- Joar Max Viktor Skalse and Alessandro Abate. The reward hypothesis is false. 2022.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056, 2024.
- Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. arXiv preprint arXiv:2402.18571, 2024.
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966, 2023.
- A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
- Robust bayesian inverse reinforcement learning with sparse behavior noise. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
- Provable multi-party reinforcement learning with diverse human feedback. arXiv preprint arXiv:2403.05006, 2024.
- Li Zhou and Kevin Small. Inverse reinforcement learning with natural language goals. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11116–11124, 2021.
- Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Keertana Chidambaram (3 papers)
- Karthik Vinay Seetharaman (1 paper)
- Vasilis Syrgkanis (106 papers)