Improving Reward Models with Synthetic Critiques (2405.20850v2)

Published 31 May 2024 in cs.CL

Abstract: Reward models (RMs) play a critical role in aligning LLMs through the process of reinforcement learning from human feedback. RMs are trained to predict a score reflecting human preference, which requires significant time and cost for human annotation. Additionally, RMs tend to quickly overfit on superficial features in the training set, hindering their generalization performance on unseen distributions. We propose a novel approach using synthetic natural language critiques generated by LLMs to provide additional feedback, evaluating aspects such as instruction following, correctness, and style. This offers richer signals and more robust features for RMs to assess and score on. We demonstrate that high-quality critiques improve the performance and data efficiency of RMs initialized from different pretrained models, reducing the reliance on costly human annotations. Furthermore, incorporating critiques improves both the interpretability and robustness of RM training.

PDF HTML Abstract

Improving Reward Models with Synthetic Critiques

The paper "Improving Reward Models with Synthetic Critiques" presents a novel approach to enhance the efficacy of reward models (RMs) used in reinforcement learning from human feedback (RLHF) for LLMs. The core innovation lies in the incorporation of synthetic natural language critiques, which are generated by LLMs themselves, to enrich the feedback signals that RMs leverage from human annotators. This methodology not only improves interpretability but also increases robustness by offering a more nuanced assessment of LLM outputs.

Key Contributions and Methodology

The paper addresses two principal challenges in the typical RM training pipeline: the high cost and labor-intensity of human annotation, and the tendency of RMs to overfit to superficial features, thus reducing their generalization capabilities. Synthetic critiques are introduced to counter these issues by providing detailed evaluations of prompt-completion pairs in aspects such as instruction adherence, correctness, and stylistic properties.

Significant steps in this process include:

Synthetic Critique Generation: LLMs are prompted to create critiques for each completion in the training set. These critiques provide detailed feedback on the quality of responses, enabling RMs to train on richer data.
Critique-Enriched RM Training: The critique-augmented data is then used to train RMs, leading to improvements in both data efficiency and model performance. Critiques are integrated into the training process to guide the RM in assigning scalar rewards, thereby enhancing interpretability and robustness.

Experimental Results

The experiments reveal several insights:

High-quality critiques drastically improve RM performance, particularly when initialized from weaker model checkpoints.
Low-quality critiques can negatively impact outcomes, underscoring the importance of the source model used for critique generation.
The addition of critiques is particularly beneficial in data-scarce scenarios, suggesting a high data efficiency. A noteworthy finding is that one high-quality critique-augmented example is approximately equivalent to 40 non-augmented examples in terms of performance gains.

Detailed evaluations on benchmark datasets such as RewardBench and PandaLM demonstrate these improvements, with critique-enriched RMs performing consistently better across various tasks, including chat, safety, and reasoning.

Implications and Future Directions

The introduction of synthetic critiques in RM training has significant theoretical and practical implications. Theoretically, this approach augments the capacity of models to internalize nuanced feedback, potentially leading to better-aligned LLMs. Practically, it opens avenues for more scalable and cost-effective reward model training processes by minimizing reliance on extensive human annotations.

Future work could explore further enhancements in critique generation, such as employing chain-of-thought methodologies to deepen reasoning capabilities. Additionally, experimenting with different model architectures and critique generation strategies could refine this approach, ensuring broader applicability and robustness across diverse language tasks.

In conclusion, the integration of synthetic critiques into RM training offers a promising pathway to more interpretable and effective RMs, enhancing the overall quality and alignment of LLMs to human preferences. This paper convincingly demonstrates the potential of such critiques to serve as a valuable supplementary tool in the field of AI model training.