Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

165

Offline Regularised Reinforcement Learning for Large Language Models Alignment (2405.19107v1)

Published 29 May 2024 in cs.LG and cs.AI

Abstract: The dominant framework for alignment of LLMs (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses, yielding a preferred and a dis-preferred response. Such data is typically scarce and expensive to collect. On the other hand, \emph{single-trajectory} datasets where each element is a triplet composed of a prompt, a response and a human feedback is naturally more abundant. The canonical element of such datasets is for instance an LLM's response to a user's prompt followed by a user's feedback such as a thumbs-up/down. Consequently, in this work, we propose DRO, or \emph{Direct Reward Optimisation}, as a framework and associated algorithms that do not require pairwise preferences. DRO uses a simple mean-squared objective that can be implemented in various ways. We validate our findings empirically, using T5 encoder-decoder LLMs, and show DRO's performance over selected baselines such as Kahneman-Tversky Optimization (KTO). Thus, we confirm that DRO is a simple and empirically compelling method for single-trajectory policy optimisation.

PDF HTML Abstract

Direct Reward Optimisation: Enhancing Single-Trajectory RLHF in LLMs

The paper presents Direct Reward Optimization (DRO), a novel framework for aligning LLMs within the Reinforcement Learning from Human Feedback (RLHF) paradigm. Unlike established techniques that rely heavily on costly pairwise human preference data, DRO leverages single-trajectory datasets (triplets of prompt, completion, and scalar reward), reflecting more abundant, naturally occurring user feedback. This shift addresses both the scarcity of pairwise data and provides a cost-effective approach to scaling RLHF.

Context and Motivation

The paper begins by discussing the prevailing methods for alignment via RLHF, typically involving the Bradley-Terry model for human preferences, where models are optimized using pairwise comparison data. However, these methods face significant challenges: the collection of pairwise preference data is both expensive and hard to scale, particularly as LLMs improve in quality, making distinctions between responses more subtle and nuanced.

DRO: Framework and Implementation

DRO is introduced as a shift from preference-driven RLHF to a single-trajectory-based paradigm. The authors propose a simple yet theoretically sound mean-squared objective that circumvents the need for direct reward signals. Specifically, DRO employs:

Mean-Squared Objective: It involves optimizing a KL-regularized policy using a simple quadratic loss.
Value Function Learning: DRO includes learning a value function alongside the policy, which underpins robust policy optimization.
Offline Data Utilization: Emphasizes the capacity for DRO to utilize static datasets — a pivotal feature that simplifies computational requirements and enhances feasibility.

The theoretical underpinnings are robustly developed, and the framework is substantiated by an existence and uniqueness theorem that guarantees the optimality of the learned policy and value function pair.

Empirical Results and Comparisons

The empirical validation employs T5 encoder-decoder models on the UltraFeedback dataset. Key findings include:

Performance against Baselines: DRO significantly outperforms Kahneman-Tversky Optimization (KTO) in side-by-side comparisons, demonstrating both higher win rates and better quality of responses.
Model Configurations: DRO's performance was stable across different learning rate configurations, reaffirming its robustness to hyperparameter selection.

Experimental Insights

Several key design choices were empirically validated:

Parameter Sharing: The paper revealed that using separate networks for policy and value functions, as well as multiple value outputs per batch, led to superior performance.
Regularization Strength: The regularization parameter ( $\tau$ ) played a critical role, with $\tau = 1.0$ providing the most balanced results.

Broader Implications and Future Research

DRO's implications extend beyond mere practical enhancement of RLHF. The approach potentially democratizes the alignment process by leveraging user feedback at scale, thereby reducing dependency on expensive human raters. This scalability can catalyze advancements in LLM training and deployment, ensuring more robust, user-aligned models.

Theoretical and Practical Considerations

Theoretically, DRO enriches the RLHF landscape with a principled method that avoids the pitfalls associated with pairwise preference models and reward modeling. Practically, it simplifies the training pipeline by removing the need for online data regeneration and direct reward modeling.

Conclusion

DRO marks a significant advancement in the alignment of LLMs by transitioning to scalable, single-trajectory datasets and providing a robust framework for leveraging user feedback. Future work should expand this approach's empirical validation to larger models and diverse tasks to further confirm its utility and scalability.

By addressing the limitations of existing methods, DRO is poised to drive more effective and efficient alignment of LLMs in real-world applications, contributing significantly to the alignment of artificial agents with human preferences.

PDF Markdown Bookmark Chat (Pro)

References (85)

Authors (18)

Pierre Harvey Richemond (5 papers)
Yunhao Tang (63 papers)
Daniel Guo (7 papers)
Daniele Calandriello (34 papers)
Mohammad Gheshlaghi Azar (31 papers)
Rafael Rafailov (37 papers)
Bernardo Avila Pires (21 papers)
Eugene Tarassov (7 papers)
Lucas Spangher (13 papers)
Will Ellsworth (1 paper)
Aliaksei Severyn (29 papers)
Jonathan Mallinson (13 papers)
Lior Shani (16 papers)
Gil Shamir (4 papers)
Rishabh Joshi (23 papers)
Tianqi Liu (49 papers)
Bilal Piot (40 papers)
Remi Munos (45 papers)

Citations (11)

View on Semantic Scholar

Tweets

https://twitter.com/iScienceLuvr/status/1796134133665841568

https://twitter.com/fly51fly/status/1796296175215329309

https://twitter.com/TheTuringPost/status/1798852602924605751

https://twitter.com/agi2025/status/1796002882451358201

https://twitter.com/mctalentowen/status/1796072467359818131

https://twitter.com/gm8xx8/status/1796000835517747203