RLHF Workflow: From Reward Modeling to Online RLHF (2405.07863v3)

Published 13 May 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent LLM literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.

PDF Abstract

Exploring Online Iterative Reinforcement Learning from Human Feedback (RLHF) for LLMs

Introduction to Online Iterative RLHF

Reinforcement Learning from Human Feedback (RLHF) has garnered significant attention for integrating human preferences into machine learning, particularly for enhancing LLMs. While existing work has predominantly focused on offline RLHF, this exploration explores the niche of online iterative RLHF, aiming to bridge the performance gap reported between offline and online modalities. Typically, human feedback, especially in an online setting, remains a challenge due to resource constraints. This work innovatively approximates this by constructing preference models from a variety of open-source datasets, which serve as proxies for human feedback in the iterative learning process.

Understanding the Process and Setup

The core of the online iterative RLHF process involves these key components:

Initial Setup:
- Starting with a model fine-tuned on known instruction-following datasets (labelled $\pi_0$ ), the model encounters prompts sampled from a fixed distribution.
- The response of the model to these prompts is guided by a policy $\pi$ , which aims to maximize a reward function as defined by the preference oracle.
Preference Oracle and Reward Function:
- A hypothetical oracle determines the preference between pairs of responses, aiding in defining the direction of model training.
- The reward function, rooted in the Bradley-Terry model, serves as a simplified approach where model preferences are modeled as a logistic function of the difference in their individual rewards.
Practical Implementation:
- Through iterative adjustments and real-time feedback simulations via proxy models, the LLM adapicates responses to better align with desired outcomes as per human feedback proxies.

Algorithmic Insights and Implementation

The workflow transitions from theoretical constructs to applied methodologies with a focus on:

Preference Model Training: Before diving into RLHF, constructing robust preference models from diverse datasets enhances the model’s capability to discern and learn from nuanced feedback, aligning closely with human judgments.
Policy Optimization: The approach cyclically updates the response policy using newly generated and historical data, refining the model iteratively to progressively approximate human preferences.
Online Versus Offline: Key differences and benefits of using online data collection include continuous model updating, which contrasts with the static nature of offline data, potentially leading to more adaptive and generalized models.

Results and Implications

The model demonstrated impressive performance across various benchmarks, including chatbot evaluations and academic benchmarks. Key takeaways include:

Performance Metrics: The model achieved state-of-the-art performance on tasks such as AlpacaEval-2 and MT-Bench, showcasing its practical effectiveness.
Extended Accessibility: By making models and training guides publicly available, the work invites further exploration and adaptation by the broader community, fostering open-source collaboration.
Future Potential: Ongoing developments could see enhancements in proxy preference modeling, more efficient data utilization, and broader applications across different LLM tasks.

Conclusion and Future Directions

This exploration into online iterative RLHF opens up several avenues for both theoretical exploration and practical applications. Future work includes addressing challenges like reward model biases, exploring different model architectures, and expanding the training datasets to cover a broader range of human-like preferences. By continuously pushing the boundaries of what open-source tools and methodologies can achieve, the field can look forward to more refined, human-aligned LLMs.