Direct Reward Optimisation: Enhancing Single-Trajectory RLHF in LLMs
The paper presents Direct Reward Optimization (DRO), a novel framework for aligning LLMs within the Reinforcement Learning from Human Feedback (RLHF) paradigm. Unlike established techniques that rely heavily on costly pairwise human preference data, DRO leverages single-trajectory datasets (triplets of prompt, completion, and scalar reward), reflecting more abundant, naturally occurring user feedback. This shift addresses both the scarcity of pairwise data and provides a cost-effective approach to scaling RLHF.
Context and Motivation
The paper begins by discussing the prevailing methods for alignment via RLHF, typically involving the Bradley-Terry model for human preferences, where models are optimized using pairwise comparison data. However, these methods face significant challenges: the collection of pairwise preference data is both expensive and hard to scale, particularly as LLMs improve in quality, making distinctions between responses more subtle and nuanced.
DRO: Framework and Implementation
DRO is introduced as a shift from preference-driven RLHF to a single-trajectory-based paradigm. The authors propose a simple yet theoretically sound mean-squared objective that circumvents the need for direct reward signals. Specifically, DRO employs:
- Mean-Squared Objective: It involves optimizing a KL-regularized policy using a simple quadratic loss.
- Value Function Learning: DRO includes learning a value function alongside the policy, which underpins robust policy optimization.
- Offline Data Utilization: Emphasizes the capacity for DRO to utilize static datasets — a pivotal feature that simplifies computational requirements and enhances feasibility.
The theoretical underpinnings are robustly developed, and the framework is substantiated by an existence and uniqueness theorem that guarantees the optimality of the learned policy and value function pair.
Empirical Results and Comparisons
The empirical validation employs T5 encoder-decoder models on the UltraFeedback dataset. Key findings include:
- Performance against Baselines: DRO significantly outperforms Kahneman-Tversky Optimization (KTO) in side-by-side comparisons, demonstrating both higher win rates and better quality of responses.
- Model Configurations: DRO's performance was stable across different learning rate configurations, reaffirming its robustness to hyperparameter selection.
Experimental Insights
Several key design choices were empirically validated:
- Parameter Sharing: The paper revealed that using separate networks for policy and value functions, as well as multiple value outputs per batch, led to superior performance.
- Regularization Strength: The regularization parameter () played a critical role, with providing the most balanced results.
Broader Implications and Future Research
DRO's implications extend beyond mere practical enhancement of RLHF. The approach potentially democratizes the alignment process by leveraging user feedback at scale, thereby reducing dependency on expensive human raters. This scalability can catalyze advancements in LLM training and deployment, ensuring more robust, user-aligned models.
Theoretical and Practical Considerations
Theoretically, DRO enriches the RLHF landscape with a principled method that avoids the pitfalls associated with pairwise preference models and reward modeling. Practically, it simplifies the training pipeline by removing the need for online data regeneration and direct reward modeling.
Conclusion
DRO marks a significant advancement in the alignment of LLMs by transitioning to scalable, single-trajectory datasets and providing a robust framework for leveraging user feedback. Future work should expand this approach's empirical validation to larger models and diverse tasks to further confirm its utility and scalability.
By addressing the limitations of existing methods, DRO is poised to drive more effective and efficient alignment of LLMs in real-world applications, contributing significantly to the alignment of artificial agents with human preferences.