Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs (2406.10216v2)

Published 14 Jun 2024 in cs.CL and cs.AI

Abstract: Reward models trained on human preference data have been proven to effectively align LLMs with human intent within the framework of reinforcement learning from human feedback (RLHF). However, current reward models have limited generalization capabilities to unseen prompts and responses, which can lead to an unexpected phenomenon known as reward over-optimization, resulting in a decline in actual performance due to excessive optimization of rewards. While previous research has advocated for constraining policy optimization, our study introduces a novel approach to enhance the reward model's generalization ability against distribution shifts by regularizing the hidden states. Specifically, we retain the base model's LLM head and incorporate a suite of text-generation losses to preserve the hidden states' text-generation capabilities, while concurrently learning a reward head behind the same hidden states. Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models across a variety of out-of-distribution (OOD) tasks and effectively alleviates the over-optimization issue in RLHF, offering a more reliable and robust preference learning paradigm.

PDF HTML Abstract

An Overview of Regularization Techniques in Hidden States for Generalizable Reward Models in LLMs

The paper "Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs" addresses a significant challenge in reinforcement learning from human feedback (RLHF) — that of reward over-optimization. This issue arises when the reward models, tuned to align LLMs with human intent, fail to generalize to new and unseen prompts, resulting in models that optimize the learned reward function but not the genuine human preferences.

The authors propose a novel approach involving regularization of the hidden states to enhance the generalization capabilities of reward models amidst distributional shifts in data. The proposed method implements a combined loss strategy that maintains the text-generation capabilities of the LLM while aligning it with the reward model's learning objectives.

Key Technical Contributions

Regularization Methodology: The authors introduce the Generalizable Reward Model (GRM), which retains the base model’s LLM head and applies distinct text-generation losses. This structure is paired with learning a reward head based on the same hidden states, thus allowing simultaneous text-generation and preference learning.
Formulation: The regularization combines DPO and SFT principles, deploying adversarial learning and log-sigmoid transformations for integrating preference learning with generalization.
Experimental Evidence: The paper presents compelling experimental results, demonstrating improved performance of the regularized reward models over conventional methods across multiple out-of-distribution (OOD) tasks, achieving robustness and reducing reward over-optimization.

Results and Implications

Through their methodology, the authors significantly alleviate the over-optimization problem in RLHF. The experimental results indicate that GRM models show higher accuracy on OOD tasks than baseline reward models, especially when the dataset size is limited. This suggests that GRM models possess better generalization prowess due to the regularization of hidden states.

Furthermore, when tested in BoN sampling and PPO scenarios—common policy optimization techniques—the GRM-trained models exhibited superior robustness compared to traditional setups. This highlights the potential of GRM to serve as a reliable proxy for human preferences in LLM applications.

Limitations and Future Prospects

While the paper demonstrates promising improvements, the authors acknowledge certain limitations, particularly regarding computational constraints preventing testing on models larger than 10B parameters. Future work could focus on scaling these insights to larger models and investigating the possible synergistic effects of using actual human-labeled data for further robustness.

Overall, this paper contributes to a growing body of research on mitigating reward model over-optimization by focusing on hidden state regularization. This approach not only provides a deft mechanism for enhancing reward models' generalization but also has wider implications for the development of aligned, robust AI systems. As AI progresses, ensuring the reliability of these models will become increasingly critical, positioning studies like this one at the forefront of methodological innovation in the field.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Rui Yang (221 papers)
Ruomeng Ding (5 papers)
Yong Lin (77 papers)
Huan Zhang (171 papers)
Tong Zhang (569 papers)

Citations (23)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/natolambert/status/1810777306312626466

https://twitter.com/RuiYang70669025/status/1811228972350681437

https://twitter.com/RuiYang70669025/status/1839030486947291384

https://twitter.com/lambdalad/status/1815147805213954253