Reinforcement Learning from Human Feedback with Dataset Reset Policy Optimization
Introduction
Reinforcement Learning from Human Feedback (RLHF) has emerged as a potent strategy for training generative models in scenarios where crafting an explicit reward function proves challenging. Utilizing human-labeled preference data, researchers have successfully trained large-scale models across diverse domains. Despite its successes, conventional RLHF protocols separate the processes of reward model learning and policy optimization, potentially overlooking the wealth of information embedded in the offline preference dataset during online policy training. This paper introduces an innovative RLHF algorithm, Dataset Reset Policy Optimization (DR-PO), leveraging dataset resets to enhance online learning significantly.
Dataset Reset Policy Optimization (DR-PO)
DR-PO capitalizes on the ability to reset to informative states within an offline dataset, enabling more efficient policy optimization. By resetting the learning agent directly to states from this dataset instead of initiating from the traditional starting state distribution, DR-PO increases exploration efficiency. This mechanism particularly benefits scenarios such as text generation in LLMs, where resets correspond to initiating generation from partial sentence states. Theoretical analysis confirms that DR-PO can match or surpass the performance of any policy covered by the offline data, offering a significant leap in efficiency and effectiveness within the RLHF paradigm.
Theoretical Guarantees
DR-PO not only showcases simplicity in implementation akin to traditional policy optimization methods but also sets a new theoretical benchmark for RLHF. Under general function approximation and finite sample complexity conditions, DR-PO guarantees learning policies at least as effective as those encapsulated within the offline preference dataset. This theoretical robustness extends to computationally tractable settings, requiring only standard learning oracles such as Maximum Likelihood Estimation (MLE) for reward model fitting. Thus, DR-PO represents a significant theoretical advancement in the RLHF domain.
Empirical Demonstrations
The paper rigorously evaluates DR-PO on two standard RLHF benchmarks: TL;DR Summarization and the Anthropic Helpful Harmful (HH) dataset, employing methods such as Proximal Policy Optimization (PPO) for comparison. Notably, DR-PO outperforms PPO and Direction Preference Optimization (DPO) across these benchmarks. In TL;DR summarization tasks, DR-PO's summaries notably surpass those delivered by PPO and DPO, evaluated using GPT-4 win-rate. Moreover, when transitioning the trained policies to a zero-shot setting on the CNN/DailyMail dataset, DR-PO maintains its superiority, highlighting its robustness and generalizability beyond the training data. These empirical outcomes solidify DR-PO's practical efficacy in optimizing RLHF tasks, blending theoretical soundness with real-world applicability.
Conclusion and Future Directions
Dataset Reset Policy Optimization introduces a pivotal advancement in the domain of RLHF, substantiated by both theoretical guarantees and strong empirical performance. The capability to leverage dataset resets in policy optimization presents a novel pathway toward more efficient and effective learning from human feedback. As the paper conjectures, the principles underpinning DR-PO may extend beyond the settings explored, suggesting a broad horizon for future investigations. The integration of dataset resets offers a promising avenue to enhance online RL algorithms further, warranting comprehensive exploration across diverse RLHF applications.