Probabilistic Uncertain Reward Model (2503.22480v6)

Published 28 Mar 2025 in cs.LG

Abstract: Reinforcement learning from human feedback (RLHF) is a critical technique for training LLMs. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Probabilistic Uncertain Reward Model (2503.22480v6)

Summary

Related Papers