Self-Rewarding Language Models (2401.10020v2)

Published 18 Jan 2024 in cs.CL and cs.AI

Abstract: We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding LLMs, where the LLM itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

PDF HTML Abstract

Introduction

Aligning LLMs with human values and preferences is critical for their effective and safe deployment. Typically, LLM training has involved human preference data to tune these models for better task compliance, using diverse approaches like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). However, these methods face limitations due to the finite scope of available human feedback and the static nature of externally built reward models. A novel paper examines the concept of Self-Rewarding LLMs, where LLMs act as both respondent to tasks and judge of their own responses, establishing a framework for self-improving, dynamic reward modeling.

Training Self-Rewarding LLMs

The paper posits that by endowing LLMs with dual capabilities—they not only generate responses to tasks but also appraise the quality of generated responses—you achieve self-alignment. This approach involves Iterative DPO training, beginning with a base pretrained LLM supplemented by a limited set of human-annotated data. Subsequent models iterate through a cycle of creating self-instruction examples and then rewarding them based on the model's own judgments. The evaluations are not arbitrary but follow formulated criteria to ensure responses' relevancy, completeness, perspective, and quality.

Methodology Insights

In a series of experiments using the Llama 2 70B model as a base, researchers demonstrate an increase in instructional performance as well as in the model's innate reward-evaluating ability. Through self-generated feedback and Iterative DPO, subsequent models surpassed their predecessor's capabilities, resulting in increasingly sophisticated LLMs. Notably, the performance of these self-rewarded models on AlpacaEval 2.0 surpasses existing LLMs trained using larger, proprietary data sets.

Implications and Future Exploration

Early findings suggest that the concept of Self-Rewarding LLMs could redefine the training of LLMs. By facilitating self-improvement, models may bypass the limitations set by human-derived reward systems. The iterative process potentially enables a continuous quality augmentation beyond existing benchmarks of human feedback quality. However, the long-term saturation of self-rewarding efficiencies, safety implications, and broader evaluative measures have yet to be fully assessed, rendering these findings preliminary yet promising avenues for future research.

PDF Markdown Bookmark Chat (Pro)

References (35)

Authors (7)

Weizhe Yuan (25 papers)
Richard Yuanzhe Pang (26 papers)
Kyunghyun Cho (292 papers)
Sainbayar Sukhbaatar (53 papers)
Jing Xu (244 papers)
Jason Weston (130 papers)
Xian Li (115 papers)

Citations (218)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1748166535795847579

https://twitter.com/BrianRoemmele/status/1748342770492731843

https://twitter.com/johnjnay/status/1748333158049701897

https://twitter.com/seungonekim/status/1865128970896757029

https://twitter.com/SamuelAlbanie/status/1755155193996435509

https://twitter.com/ianwu97/status/1843341932384387122