It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

Published 12 Jun 2024 in cs.CL, cs.AI, and cs.LG | (2406.07971v2)

Abstract: Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align LLMs with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning, introducing the concept of seamlessness. Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress. Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM. To measure seamlessness between PM and RM without human effort, we propose an automatic metric, SEAM. SEAM quantifies the discrepancies between PM and RM judgments induced by data samples. We validate the effectiveness of SEAM in data selection and model augmentation. Our experiments demonstrate that (1) using SEAM-filtered data for RL training improves RLHF performance by 4.5%, and (2) SEAM-guided model augmentation results in a 4% performance improvement over standard augmentation methods.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces seamlessness as a novel metric to evaluate the alignment between reward and policy models in RLHF.
It demonstrates that filtering data with the sigma metric enhances RLHF performance by 4.5% and improves model augmentation by 4%.
The study reveals a 35-40% mismatch between model evaluations and human judgments, highlighting a critical saturation phenomenon.

A Formal Analysis of Seamlessness between Reward and Policy Models in RLHF

The paper "It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF" explores the interactions between Reward Models (RMs) and Policy Models (PMs) in the context of Reinforcement Learning from Human Feedback (RLHF). Unlike traditional approaches that focus on the independent optimization of RMs and PMs, this study introduces the concept of seamlessness as a metric to evaluate and improve the synergy between these models during the RLHF process.

Abstract and Contributions

The paper presents a novel investigation into the phenomenon of saturation in RLHF, where incremental improvements in RM and PM do not correlate with enhanced RLHF performance. Notably, it reports a 35% mismatch rate between RM scores and human preferences, demonstrating a significant divergence between RM evaluations and actual human judgments.

Key contributions include the proposal of an automatic metric to quantify seamlessness, denoted as $\sigma$ , which reflects the inconsistencies between RM and PM judgments. Experimental results show that using $\sigma$ -filtered data enhances RLHF performance by 4.5% and that model augmentation guided by $\sigma$ improves performance by 4%.

Introduction to RLHF and Seamlessness

RLHF is a prominent technique to align LLMs with human preferences, leveraging RMs to generate scalar rewards and PMs to optimize model outputs based on these rewards. The primary challenge addressed in this paper is the suboptimal interaction between RM and PM, coined as the saturation phenomenon, where beyond a certain threshold, improvements in RM and PM do not yield better RLHF performance.

Core Findings and Methodologies

The Saturation Phenomenon

Through extensive experiments using StackLLaMa as the base framework and the LLaMa2-7B model for RM and PM, the study identifies a saturation point beyond which further improvements in RM and PM quality fail to enhance RLHF outcomes. Metrics for evaluating the quality of RMs and PMs, such as $\mathcal{Q}_{PM}$ and $\mathcal{Q}_{RM}$ , were employed to substantiate these findings.

Discrepancy Analysis

A critical observation is that the RM exhibits a 40% mismatch rate with human judgment when evaluating PM-generated responses, suggesting that the RM's inability to appropriately score PM outputs contributes to saturation. This discrepancy persists even with model scaling, indicating a fundamental misalignment that scaling alone cannot resolve.

Introducing Seamlessness and Metric $\sigma$

Seamlessness is defined to quantify the alignment between PM and RM. The seamlessness metric, $\sigma$ , focuses on the probabilistic interpretation of RM judgment discrepancies when conditioned on PM responses. Three computational variants of $\sigma$ are introduced: $_{\text{Adv}$, $_{\text{Contrast}$, and $_{\text{GPT}$, each employing different methodologies to assess seamlessness without human intervention.

Practical Implementations and Improvements

Data Selection

The study shows that filtering 20% of the RL training data based on $\sigma$ results significantly enhances RLHF performance. This approach effectively mitigates the saturation phenomenon, leading to better alignment and performance metrics compared to using the full dataset.

Model Augmentation

Moreover, using $\sigma$ to guide the augmentation of RMs and PMs by targeting data samples that highlight weaknesses in the models further optimizes RLHF outcomes. This targeted augmentation method demonstrates substantial improvements over standard augmentation techniques, reinforcing the utility of $\sigma$ as a diagnostic tool in RLHF.

Theoretical and Practical Implications

The concept of seamlessness introduces a data-centric perspective to improving RLHF, emphasizing the importance of model interactivity over isolated model performance. Practically, this contributes to better-designed training regimes, reduced resource consumption, and more aligned AI outputs.

Future Developments

Future research might explore seamlessness in online RLHF settings to further validate its robustness and utility in dynamic, real-world environments. Additionally, refining the metric $\sigma$ to incorporate absolute measures of data quality could enhance its diagnostic capability and generalizability across diverse applications.

Conclusion

The paper significantly advances understanding and methodologies in RLHF by spotlighting the importance of seamlessness between RMs and PMs. The introduction of $\sigma$ as a practical metric offers a robust tool for enhancing RLHF performance, paving the way for more reliable and human-aligned LLMs. With these insights, the field moves closer to fully harnessing the potential of reinforcement learning to achieve nuanced, preference-aligned AI systems.