Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Preference as Reward, Maximum Preference Optimization with Importance Sampling (2312.16430v5)

Published 27 Dec 2023 in cs.LG and cs.AI

Abstract: Preference learning is a key technology for aligning LLMs with human values. Reinforcement Learning from Human Feedback (RLHF) is a model-based algorithm to optimize preference learning, which first fits a reward model for preference scores and then optimizes the generating policy with an on-policy PPO algorithm to maximize the reward. The processing of RLHF is complex, time-consuming, and unstable. The Direct Preference Optimization (DPO) algorithm uses an off-policy algorithm to directly optimize the generating policy and eliminates the need for a reward model. DPO is more data-efficient and stable. However, DPO has a drawback of overfitting to the preference data and ignoring the KL-regularization term when the preference is deterministic. Identity mapping Preference Optimization(IPO) uses a root-finding MSE loss to incorporate KL-regularization. However, both DPO and IPO fail to properly address the KL-regularization term because the support of the preference distribution is not equal to the reference distribution. In this paper, we propose a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO). MPO incorporates the off-policy KL-regularization term, making regularization truly effective. MPO achieves the best of both worlds by combining the objectives of RLHF and IPO while being an off-policy algorithm. Furthermore, MPO eliminates the need for a reward model and reference policy, simplifying the learning process and reducing memory usage.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  4. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  5. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  7. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  8. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  9. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  10. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  11. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  12. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  13. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  14. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  15. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  16. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zaifan Jiang (6 papers)
  2. Xing Huang (121 papers)
  3. Chao Wei (16 papers)
Citations (1)