RRM: Robust Reward Model Training Mitigates Reward Hacking (2409.13156v1)

Published 20 Sep 2024 in cs.CL

Abstract: Reward models (RMs) play a pivotal role in aligning LLMs with human preferences. However, traditional RM training, which relies on response pairs tied to specific prompts, struggles to disentangle prompt-driven preferences from prompt-independent artifacts, such as response length and format. In this work, we expose a fundamental limitation of current RM training methods, where RMs fail to effectively distinguish between contextual signals and irrelevant artifacts when determining preferences. To address this, we introduce a causal framework that learns preferences independent of these artifacts and propose a novel data augmentation technique designed to eliminate them. Extensive experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model (RRM). Our RRM improves the performance of a pairwise reward model trained on Gemma-2-9b-it, on RewardBench, increasing accuracy from 80.61% to 84.15%. Additionally, we train two DPO policies using both the RM and RRM, demonstrating that the RRM significantly enhances DPO-aligned policies, improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in AlpacaEval-2 from 33.46% to 52.49%.

Authors (18)

Tianqi Liu (49 papers)
Wei Xiong (172 papers)
Jie Ren (329 papers)
Lichang Chen (30 papers)
Junru Wu (23 papers)
Rishabh Joshi (23 papers)
Yang Gao (762 papers)
Jiaming Shen (56 papers)
Zhen Qin (105 papers)
Tianhe Yu (36 papers)
Daniel Sohn (5 papers)
Anastasiia Makarova (4 papers)
Jeremiah Liu (16 papers)
Yuan Liu (343 papers)
Bilal Piot (40 papers)
Abe Ittycheriah (9 papers)
Aviral Kumar (74 papers)
Mohammad Saleh (19 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a novel robust reward model training method that successfully mitigates reward hacking in reinforcement learning from human feedback.
It employs a causal data augmentation strategy using faithful DAGs to isolate genuine human preference signals from extraneous artifacts.
Empirical evaluations show improvements in RewardBench accuracy (80.61% to 84.15%) and enhanced policy alignment in MT-Bench and AlpacaEval-2.

Robust Reward Model Training for Mitigating Reward Hacking

The research paper titled "RRM: Robust Reward Model Training Mitigates Reward Hacking" by Liu et al. offers significant insights into a crucial challenge in the field of reinforcement learning from human feedback (RLHF) for LLMs. The authors deftly identify a fundamental weakness in existing reward model (RM) training methodologies, where models often fail to distinguish between contextual cues that align with human preferences and superfluous, context-free features such as response verbosity or formatting.

Problem Identification

RLHF has been instrumental in shaping LLMs to deliver responses that are more in tune with human preferences. However, an ongoing challenge is reward hacking, a situation where LLMs exploit weaknesses in the reward models to gain higher scores on the reward function without genuinely aligning with human objectives. A common manifestation of this problem is the bias toward generating excessively verbose responses, as human raters may subconsciously favor longer answers. This issue is primarily attributed to the existing methodologies' inability to separate prompt-related signals from unrelated artifacts like response length or specific stylistic patterns.

Proposed Methodology

To address these shortcomings, the authors propose a novel robust reward model, referred to as RRM. They introduce an inventive causal framework that can learn preferences independently of non-contextual artifacts. Central to this framework is the application of data augmentation techniques aimed at filtering out these artifacts, essentially teaching the reward model to isolate and focus on genuine quality signals in preference learning.

The paper describes a two-step augmentation strategy leveraging faithful DAGs (Directed Acyclic Graphs) to break the spurious dependencies between non-contextual artifacts and human preference signals. By systematically augmenting the training dataset with various permutations of response pairs to offset artifact biases, the authors effectively render the reward model impervious to reward hacking - specifically in terms of preferring longer or superficially enhanced responses.

Empirical Evaluation

A series of experiments compare the proposed RRM against traditional models, focusing on its effectiveness in mitigating artifices like verbosity bias. Notably, RRM improved reward model performance on the RewardBench benchmark, increasing accuracy from 80.61% to 84.15%. Furthermore, the impact of RRM on policy model alignment was examined using DPO-aligned policies. The results demonstrated significant advancements, as evidenced by improvements in MT-Bench scores from 7.27 to 8.31 and in length-controlled win-rates in AlpacaEval-2 from 33.46% to 52.49%.

Implications and Future Directions

The implications of this work are substantial, as it introduces a more robust methodology for developing RMs capable of effectively disentangling valuable preference signals from irrelevant data attributes. This advancement not only bolsters the efficacy of RLHF pipelines but also enhances the reliability of the derived LLMs in real-world applications by ensuring alignment with true human preferences rather than with artificial signal exploitations.

Looking forward, the paper suggests intriguing possibilities, such as extending the causal framework to other dimensions of contextual and artifact differences that may arise in LLM outputs. Furthermore, the paper opens pathways for integrating this methodology into broader AI systems devoid of traditional RLHF paradigms, thus enhancing general applicability in diverse AI alignment tasks.

In conclusion, the paper provides a compelling contribution to the ongoing discourse in AI alignment and reward modeling, introducing a refined approach that circumvents common pitfalls in RLHF by leveraging a robust, causal data augmentation strategy. This work lays the groundwork for future innovations in more nuanced and artifact-immune reward model training methodologies.

RRM: Robust Reward Model Training Mitigates Reward Hacking (2409.13156v1)

Summary

Robust Reward Model Training for Mitigating Reward Hacking

Problem Identification

Proposed Methodology

Empirical Evaluation

Implications and Future Directions

Tweets

YouTube

RRM: Robust Reward Model Training Mitigates Reward Hacking (2409.13156v1)

Summary

Robust Reward Model Training for Mitigating Reward Hacking

Problem Identification

Proposed Methodology

Empirical Evaluation

Implications and Future Directions

Related Papers

Tweets

YouTube