Semi-Supervised Reward Modeling via Iterative Self-Training (2409.06903v1)

Published 10 Sep 2024 in cs.LG

Abstract: Reward models (RM) capture the values and preferences of humans and play a central role in Reinforcement Learning with Human Feedback (RLHF) to align pretrained LLMs. Traditionally, training these models relies on extensive human-annotated preference data, which poses significant challenges in terms of scalability and cost. To overcome these limitations, we propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. Given an unlabeled dataset, SSRM involves three key iterative steps: pseudo-labeling unlabeled examples, selecting high-confidence examples through a confidence threshold, and supervised finetuning on the refined dataset. Across extensive experiments on various model configurations, we demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Notably, SSRM can achieve performance comparable to models trained entirely on labeled data of equivalent volumes. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.

Authors (5)

Yifei He (31 papers)
Haoxiang Wang (35 papers)
Ziyan Jiang (17 papers)
Alexandros Papangelis (23 papers)
Han Zhao (159 papers)

Citations (1)

View on Semantic Scholar

Summary

Semi-Supervised Reward Modeling via Iterative Self-Training

Introduction

The paper "Semi-Supervised Reward Modeling via Iterative Self-Training" by He et al. addresses a critical issue in Reinforcement Learning with Human Feedback (RLHF): the extensive reliance on human-annotated preference data for training reward models (RMs). This dependency poses significant challenges in terms of scalability and cost. As an innovative solution, the authors propose Semi-Supervised Reward Modeling (SSRM), a method that effectively leverages unlabeled data to enhance reward models. The authors demonstrate through extensive experiments that SSRM not only produces robust reward models but also significantly reduces the need for large volumes of human-annotated data.

Methodology

The core of SSRM involves three key iterative steps: pseudo-labeling, confidence thresholding, and supervised finetuning. Initially, a pretrained LLM is fine-tuned on a small labeled dataset, resulting in the initial supervised reward model. This foundational model is then used to pseudo-label the unlabeled dataset by predicting preferences for paired responses. High-confidence pseudo-labeled examples are selected using a predefined confidence threshold and are combined with the original labeled dataset. The reward model is then further fine-tuned on this augmented dataset. This process is repeated iteratively, with each iteration refining the model's ability to predict human preferences.

Experimental Setup and Results

The researchers evaluated SSRM across a range of model sizes and configurations, encompassing both encoder-based models and larger LLMs. Three specific models were assessed: PairRM (0.4B parameters), Gemma-2B-it (2B parameters), and Llama3-8b-it (8B parameters). Diverse datasets were employed for training and evaluation, including OpenHermesPreferences and a mixture of eight open-source datasets.

Gemma-2B Results

The SSRM-enhanced Gemma-2B showed progressive improvement over iterations, with significant gains observed in prediction confidence and performance across different benchmarks. The performance of SSRM-enhanced models approached that of fully supervised models, demonstrating the method's efficiency in utilizing limited labeled data. Notably, SSRM models performed substantially better across categories such as Chat and Reasoning, although the performance in Chat Hard exhibited a slight decline due to conflicting biases.

Llama3-8B Results

For Llama3-8B, the improvements following each iteration of SSRM were even more pronounced. The model demonstrated substantial gains particularly in the Safety and Reasoning categories. The performance metrics of the SSRM-enhanced model closely mirrored those of a model trained on the entire dataset, validating the effectiveness of the semi-supervised approach for larger models.

PairRM Results

Even for smaller encoder-based models such as PairRM, SSRM yielded significant improvements, particularly in categories where the initial model was weaker, such as Safety. Similar to the larger models, the performance plateaued after a few iterations, highlighting the iterative self-training's immediate impact.

Practical Implications

The enhancements provided by SSRM are particularly relevant for practical applications where collecting extensive labeled datasets is impractical. By significantly reducing reliance on human-annotated data, SSRM presents a cost-effective alternative for training reward models. This method could be instrumental in scaling the development of RLHF applications, including alignment of LLMs for complex tasks like mathematical reasoning, code generation, and summarization.

Future Directions

The paper opens several avenues for future research:

Calibration and Confidence Thresholding: Refining the confidence thresholding mechanism to dynamically adapt during training could further enhance SSRM's efficacy.
Application to Other Domains: Expanding the application of SSRM to other domains such as robotics or computer vision where semi-supervised learning frameworks have shown promise.
Integration with RLAIF: Combining SSRM with Reinforcement Learning from Artificial Intelligence Feedback (RLAIF) to further reduce dependency on costly human annotations, potentially by integrating advanced LLMs as surrogate human judges.

Conclusion

Semi-Supervised Reward Modeling via Iterative Self-Training presents a highly efficient technique for enhancing reward models while reducing the need for extensive human-annotated data. Through iterative self-training and confidence-based thresholding, SSRM demonstrates substantial improvements across various models, showcasing its potential for broad applicability in RLHF tasks. This method addresses key scalability and cost challenges, paving the way for more accessible and efficient development of aligned LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/heyifei99/status/1837190107536691423