Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment (2404.01054v3)

Published 1 Apr 2024 in cs.CL and cs.AI

Abstract: Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning LLMs to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A common solution to prevent reward hacking in preference learning techniques is to optimize a reward using proximity regularization (e.g., KL regularization), which ensures that the LLM remains close to the reference model. In this research, we propose Regularized Best-of-N (RBoN), a variant of BoN that aims to mitigate reward hacking by incorporating a proximity term in response selection, similar to preference learning techniques. We evaluate RBoN on the AlpacaFarm and Anthropic's hh-rlhf datasets and find that it outperforms BoN. As an application of RBoN, we use RBoN to generate a pairwise preference learning dataset. Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN. Our code is available at https://github.com/CyberAgentAILab/regularized-bon

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (58)

Authors (4)

Yuu Jinnai (21 papers)
Tetsuro Morimura (18 papers)
Kaito Ariu (32 papers)
Kenshi Abe (32 papers)

Citations (5)

View on Semantic Scholar

GitHub

GitHub - CyberAgentAILab/regularized-bon: Code of "Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment" (2024). (14 stars)

YouTube

Show All Videos

Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment (2404.01054v3)

Related Papers

GitHub

YouTube