A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization (2504.04950v1)

Published 7 Apr 2025 in cs.LG

Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a important paradigm for aligning LLMs with human preferences during post-training. This framework typically involves two stages: first, training a reward model on human preference data, followed by optimizing the LLM using reinforcement learning algorithms. However, current RLHF approaches may constrained by two limitations. First, existing RLHF frameworks often rely on Bradley-Terry models to assign scalar rewards based on pairwise comparisons of individual responses. However, this approach imposes significant challenges on reward model (RM), as the inherent variability in prompt-response pairs across different contexts demands robust calibration capabilities from the RM. Second, reward models are typically initialized from generative foundation models, such as pre-trained or supervised fine-tuned models, despite the fact that reward models perform discriminative tasks, creating a mismatch. This paper introduces Pairwise-RL, a RLHF framework that addresses these challenges through a combination of generative reward modeling and a pairwise proximal policy optimization (PPO) algorithm. Pairwise-RL unifies reward model training and its application during reinforcement learning within a consistent pairwise paradigm, leveraging generative modeling techniques to enhance reward model performance and score calibration. Experimental evaluations demonstrate that Pairwise-RL outperforms traditional RLHF frameworks across both internal evaluation datasets and standard public benchmarks, underscoring its effectiveness in improving alignment and model behavior.

Authors (6)

Wenyuan Xu (35 papers)
Xiaochen Zuo (6 papers)
Chao Xin (6 papers)
Yu Yue (14 papers)
Lin Yan (168 papers)
Yonghui Wu (115 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization (2504.04950v1)

Summary

Related Papers