Preference as Reward, Maximum Preference Optimization with Importance Sampling (2312.16430v5)

Published 27 Dec 2023 in cs.LG and cs.AI

Abstract: Preference learning is a key technology for aligning LLMs with human values. Reinforcement Learning from Human Feedback (RLHF) is a model-based algorithm to optimize preference learning, which first fits a reward model for preference scores and then optimizes the generating policy with an on-policy PPO algorithm to maximize the reward. The processing of RLHF is complex, time-consuming, and unstable. The Direct Preference Optimization (DPO) algorithm uses an off-policy algorithm to directly optimize the generating policy and eliminates the need for a reward model. DPO is more data-efficient and stable. However, DPO has a drawback of overfitting to the preference data and ignoring the KL-regularization term when the preference is deterministic. Identity mapping Preference Optimization(IPO) uses a root-finding MSE loss to incorporate KL-regularization. However, both DPO and IPO fail to properly address the KL-regularization term because the support of the preference distribution is not equal to the reference distribution. In this paper, we propose a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO). MPO incorporates the off-policy KL-regularization term, making regularization truly effective. MPO achieves the best of both worlds by combining the objectives of RLHF and IPO while being an off-policy algorithm. Furthermore, MPO eliminates the need for a reward model and reference policy, simplifying the learning process and reducing memory usage.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (16)

Authors (3)

Zaifan Jiang (6 papers)
Xing Huang (121 papers)
Chao Wei (16 papers)

Citations (1)

View on Semantic Scholar

Preference as Reward, Maximum Preference Optimization with Importance Sampling (2312.16430v5)

Related Papers