Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation (2403.05171v2)

Published 8 Mar 2024 in cs.LG and cs.AI

Abstract: We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for LLMs. Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (49)

Authors (5)

Xiaoying Zhang (32 papers)
Jean-Francois Ton (25 papers)
Wei Shen (181 papers)
Hongning Wang (107 papers)
Yang Liu (2253 papers)

Citations (9)

View on Semantic Scholar

Tweets

https://twitter.com/ifaposto/status/1822975262163640373

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation (2403.05171v2)

Related Papers

Tweets