CREAM: Consistency Regularized Self-Rewarding Language Models (2410.12735v5)

Published 16 Oct 2024 in cs.LG and cs.CL

Abstract: Recent self-rewarding LLMs (LLM) have successfully applied LLM-as-a-Judge to iteratively improve the alignment performance without the need of human annotations for preference data. These methods commonly utilize the same LLM to act as both the policy model (which generates responses) and the reward model (which scores and ranks those responses). The ranked responses are then used as preference pairs to train the LLM via direct alignment technologies (e.g. DPO). However, it is noteworthy that throughout this process, there is no guarantee of accuracy in the rewarding and ranking, which is critical for ensuring accurate rewards and high-quality preference data. Empirical results from relatively small LLMs (e.g., 7B parameters) also indicate that improvements from self-rewarding may diminish after several iterations in certain situations, which we hypothesize is due to accumulated bias in the reward system. This bias can lead to unreliable preference data for training the LLM. To address this issue, we first formulate and analyze the generalized iterative preference fine-tuning framework for self-rewarding LLM. We then introduce the regularization to this generalized framework to mitigate the overconfident preference labeling in the self-rewarding process. Based on this theoretical insight, we propose a Consistency Regularized sElf-rewarding LLM (CREAM) that leverages the consistency of rewards across different iterations to regularize the self-rewarding training, helping the model to learn from more reliable preference data. With this explicit regularization, our empirical results demonstrate the superiority of CREAM in improving both reward consistency and alignment performance. The code is publicly available at https://github.com/Raibows/CREAM.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces CREAM, an iterative preference fine-tuning framework that uses consistency regularization to mitigate self-rewarding bias.
It leverages past model iterations as a baseline to stabilize training without relying on human-annotated data.
Empirical results on benchmarks like ARC, OpenBookQA, and GSM8K show that CREAM significantly outperforms standard SRLMs.

Consistency-Enhanced Self-Rewarding in LLMs: A Structured Approach

The paper "Cream: Consistency Regularized Self-Rewarding LLMs" presents an in-depth exploration of the iterative preference fine-tuning framework for self-rewarding LLMs (SRLMs), primarily focusing on improving model alignment without human-annotated data. The authors address a significant issue within SRLMs: the rewarding bias that arises from overconfident preference labeling during self-rewarding processes.

Core Methodology

To combat this bias, the authors propose the Consistency Regularized Self-Rewarding LLM (CREAM). The approach is designed to enhance the reliability of preference data by using consistency regularization across training iterations. This method taps into the consistency between successive iterations of the model, effectively mitigating the accumulation of bias and ensuring more accurate training data.

The paper systematically formulates the generalized iterative preference fine-tuning framework, which encompasses self-rewarding, reinforcement learning with AI feedback, and other iterative preference tuning approaches. Within this framework, the concept of reward consistency is introduced, where past iterations' models serve as baseline references to measure current iteration consistency, thereby acting as an internal regularization mechanism.

Numerical and Comparative Analysis

The experimental results presented in the paper demonstrate the effectiveness of the CREAM method across multiple natural language benchmarks, such as ARC, OpenBookQA, SIQA, and GSM8K. Specifically, the empirical findings indicate that CREAM consistently outperforms baseline methods, particularly standard SRLMs, by a noticeable margin. For instance, while standard SRLMs exhibit degraded performance in subsequent iterations due to noisy annotations, CREAM demonstrates continual improvement, showcasing the robustness of the proposed consistency regularization.

Furthermore, the results highlight that the use of a baseline reward model, even when represented by a previous model iteration, can substantially enhance the alignment performance when compared to self-rewarding methods without such regularization.

Theoretical Contributions and Implications

The primary theoretical contribution lies in the development of a consistency regularization framework tailored for SRLMs. This involves leveraging the intrinsic reward model for preference data annotation and employing self-consistency measures to avoid overfitting on unreliable data. The paper verifies these theoretical insights through comprehensive empirical analysis, thereby bridging the gap between theoretical design and practical implementation.

The implications of this research are multi-fold. Practically, the reduction of reliance on human-annotated data makes the approach more scalable and efficient, particularly benefiting smaller LLMs, which have historically struggled with accurate self-rewarding processes. Theoretically, this work opens avenues for exploring self-consistency in other domains of machine learning and AI, where iterative improvement is crucial.

Future Prospects

The paper anticipates several future directions, such as extending the applicability of consistency regularization to larger LLMs and assessing its effectiveness in other iterative self-improvement frameworks. Additionally, the concept of utilizing self-consistency as a regularization mechanism can be explored further in various AI domains to enhance model robustness and reliability.

In summary, the paper presents a well-crafted solution to a pervasive problem in SRLMs, with compelling empirical validation across standard benchmarks. By integrating self-consistency as a regulatory component, CREAM marks a significant step forward in the autonomous improvement of LLMs, potentially setting a precedent for future advancements in AI alignment and self-rewarding methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - Raibows/CREAM: Code for paper "CREAM: Consistency Regularized Self-Rewarding Language Models".

Tweets

https://twitter.com/HuaxiuYaoML/status/1846775530949120191

https://twitter.com/zhaoywang_CS/status/1914048617817801097