Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Policy-labeled Preference Learning: Is Preference Enough for RLHF? (2505.06273v2)

Published 6 May 2025 in cs.LG and cs.AI

Abstract: To design rewards that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing policies via reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. Inspired by Direct Preference Optimization framework which directly learns optimal policy without explicit reward, we propose policy-labeled preference learning (PPL), to resolve likelihood mismatch issues by modeling human preferences with regret, which reflects behavior policy information. We also provide a contrastive KL regularization, derived from regret-based principles, to enhance RLHF in sequential decision making. Experiments in high-dimensional continuous control tasks demonstrate PPL's significant improvements in offline RLHF performance and its effectiveness in online settings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Taehyun Cho (6 papers)
  2. Seokhun Ju (2 papers)
  3. Seungyub Han (7 papers)
  4. Dohyeong Kim (62 papers)
  5. Kyungjae Lee (37 papers)
  6. Jungwoo Lee (39 papers)