Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Offline Regularised Reinforcement Learning for Large Language Models Alignment (2405.19107v1)

Published 29 May 2024 in cs.LG and cs.AI
Offline Regularised Reinforcement Learning for Large Language Models Alignment

Abstract: The dominant framework for alignment of LLMs (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses, yielding a preferred and a dis-preferred response. Such data is typically scarce and expensive to collect. On the other hand, \emph{single-trajectory} datasets where each element is a triplet composed of a prompt, a response and a human feedback is naturally more abundant. The canonical element of such datasets is for instance an LLM's response to a user's prompt followed by a user's feedback such as a thumbs-up/down. Consequently, in this work, we propose DRO, or \emph{Direct Reward Optimisation}, as a framework and associated algorithms that do not require pairwise preferences. DRO uses a simple mean-squared objective that can be implemented in various ways. We validate our findings empirically, using T5 encoder-decoder LLMs, and show DRO's performance over selected baselines such as Kahneman-Tversky Optimization (KTO). Thus, we confirm that DRO is a simple and empirically compelling method for single-trajectory policy optimisation.

Direct Reward Optimisation: Enhancing Single-Trajectory RLHF in LLMs

The paper presents Direct Reward Optimization (DRO), a novel framework for aligning LLMs within the Reinforcement Learning from Human Feedback (RLHF) paradigm. Unlike established techniques that rely heavily on costly pairwise human preference data, DRO leverages single-trajectory datasets (triplets of prompt, completion, and scalar reward), reflecting more abundant, naturally occurring user feedback. This shift addresses both the scarcity of pairwise data and provides a cost-effective approach to scaling RLHF.

Context and Motivation

The paper begins by discussing the prevailing methods for alignment via RLHF, typically involving the Bradley-Terry model for human preferences, where models are optimized using pairwise comparison data. However, these methods face significant challenges: the collection of pairwise preference data is both expensive and hard to scale, particularly as LLMs improve in quality, making distinctions between responses more subtle and nuanced.

DRO: Framework and Implementation

DRO is introduced as a shift from preference-driven RLHF to a single-trajectory-based paradigm. The authors propose a simple yet theoretically sound mean-squared objective that circumvents the need for direct reward signals. Specifically, DRO employs:

  1. Mean-Squared Objective: It involves optimizing a KL-regularized policy using a simple quadratic loss.
  2. Value Function Learning: DRO includes learning a value function alongside the policy, which underpins robust policy optimization.
  3. Offline Data Utilization: Emphasizes the capacity for DRO to utilize static datasets — a pivotal feature that simplifies computational requirements and enhances feasibility.

The theoretical underpinnings are robustly developed, and the framework is substantiated by an existence and uniqueness theorem that guarantees the optimality of the learned policy and value function pair.

Empirical Results and Comparisons

The empirical validation employs T5 encoder-decoder models on the UltraFeedback dataset. Key findings include:

  • Performance against Baselines: DRO significantly outperforms Kahneman-Tversky Optimization (KTO) in side-by-side comparisons, demonstrating both higher win rates and better quality of responses.
  • Model Configurations: DRO's performance was stable across different learning rate configurations, reaffirming its robustness to hyperparameter selection.

Experimental Insights

Several key design choices were empirically validated:

  • Parameter Sharing: The paper revealed that using separate networks for policy and value functions, as well as multiple value outputs per batch, led to superior performance.
  • Regularization Strength: The regularization parameter (τ\tau) played a critical role, with τ=1.0\tau = 1.0 providing the most balanced results.

Broader Implications and Future Research

DRO's implications extend beyond mere practical enhancement of RLHF. The approach potentially democratizes the alignment process by leveraging user feedback at scale, thereby reducing dependency on expensive human raters. This scalability can catalyze advancements in LLM training and deployment, ensuring more robust, user-aligned models.

Theoretical and Practical Considerations

Theoretically, DRO enriches the RLHF landscape with a principled method that avoids the pitfalls associated with pairwise preference models and reward modeling. Practically, it simplifies the training pipeline by removing the need for online data regeneration and direct reward modeling.

Conclusion

DRO marks a significant advancement in the alignment of LLMs by transitioning to scalable, single-trajectory datasets and providing a robust framework for leveraging user feedback. Future work should expand this approach's empirical validation to larger models and diverse tasks to further confirm its utility and scalability.

By addressing the limitations of existing methods, DRO is poised to drive more effective and efficient alignment of LLMs in real-world applications, contributing significantly to the alignment of artificial agents with human preferences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arxiv preprint arXiv:2402.14740, 2024.
  2. Concrete problems in AI safety. arXiv, 2016.
  3. PaLM 2 technical report, 2023.
  4. A general theoretical paradigm to understand learning from human preferences. arXiv, 2023.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022a.
  6. Constitutional AI: Harmlessness from AI feedback. arXiv, 2022b.
  7. Comparing bad apples to good oranges: Aligning large language models via joint preference optimization. arxiv preprint arXiv: 2404.00530, 2024.
  8. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2011.
  9. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022.
  10. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  11. Human alignment of large language models through online preference optimisation. arxiv preprint arXiv:2403.08635, 2024.
  12. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  13. A simple framework for contrastive learning of visual representations. arxiv preprint arXiv:2002.05709, 2020.
  14. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017.
  15. Scaling instruction-finetuned language models, 2022.
  16. Reward model ensembles help mitigate overoptimization. arXiv, 2023.
  17. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023. URL https://github.com/OpenBMB/UltraFeedback. MIT license.
  18. Bert: Pre-training of deep bidirectional transformers for language understanding. arxiv preprint arXiv:1810.04805, 2019.
  19. RAFT: Reward rAnked FineTuning for generative foundation model alignment. arXiv, 2023.
  20. Helping or herding? Rward model ensembles mitigate but do not eliminate reward hacking. arXiv, 2023.
  21. Kto: Model alignment as prospect theoretic optimization. arxiv preprint arXiv:2402.01306, 2024.
  22. Scaling laws for reward model overoptimization. In Proceedings of the International Conference on Machine Learning, 2022.
  23. Rebel: Reinforcement learning via regressing relative rewards. arxiv preprint arXiv:2404.16767, 2024.
  24. Improving alignment of dialogue agents via targeted human judgements. arXiv, 2022.
  25. Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems, 2013.
  26. Bootstrap your own latent: A new approach to self-supervised learning. In Advances in Neural Information Processing Systems, 2020.
  27. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
  28. Reinforcement learning with deep energy-based policies. arxiv preprint arXiv:1702.08165, 2017.
  29. Beware of botshit: How to manage the epistemic risks of generative chatbots. Business Horizons, 2024. ISSN 0007-6813. doi: https://doi.org/10.1016/j.bushor.2024.03.001. URL https://www.sciencedirect.com/science/article/pii/S0007681324000272.
  30. Camels in a changing climate: Enhancing LM adaptation with Tulu 2. arXiv, 2023.
  31. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv, 2019.
  32. Towards efficient and exact optimization of language model alignment. arxiv preprint arXiv:2402.00856, 2024.
  33. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the Annual International Symposium on Computer Architecture, 2023.
  34. D. Kahneman and A. Tversky. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–292, 1979.
  35. Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452, 2023.
  36. TAMER: Training an agent manually via evaluative reinforcement. In Proceedings of the IEEE International Conference on Development and Learning, 2008.
  37. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
  38. RLAIF: Scaling reinforcement learning from human feedback with AI feedback. arXiv, 2023.
  39. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  40. Enhancing llm safety via constrained direct preference optimization, 2024.
  41. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
  42. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016.
  43. Nash learning from human feedback. arXiv, 2023.
  44. WebGPT: Browser-assisted question-answering with human feedback. arXiv, 2021.
  45. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  46. Training language models to follow instructions with human feedback. arXiv, 2022.
  47. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv, abs/2201.03544, 2022.
  48. Reward gaming in conditional text generation. In Annual Meeting of the Association for Computational Linguistics, 2022.
  49. Disentangling length from quality in direct preference optimization, 2024.
  50. Learning transferable visual models from natural language supervision. arxiv preprint arXiv:2103.00020, 2021.
  51. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2023.
  52. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  53. WARM: On the benefits of weight averaged reward models. arXiv, 2024.
  54. A short variational proof of equivalence between policy gradients and soft q learning. arxiv preprint arXiv:1712.08650, 2017.
  55. Scaling up models and data with t5x and seqio. arXiv, 2022. URL https://github.com/google-research/t5x. Apache-2.0 license.
  56. Direct nash optimization: Teaching language models to self-improve with general preferences, 2024.
  57. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  58. Proximal policy optimization algorithms. arXiv, 2017.
  59. Equivalence between policy gradients and soft q-learning. arxiv preprint arXiv:1704.06440, 2018.
  60. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv, 2018.
  61. Benchmarks and algorithms for offline preference-based reward learning. arXiv, 2023.
  62. A long way to go: Investigating length correlations in rlhf. ArXiv, abs/2310.03716, 2023.
  63. Defining and characterizing reward gaming. In Neural Information Processing Systems, 2022.
  64. Aligning large multimodal models with factually augmented rlhf, 2023.
  65. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000.
  66. A minimaximalist approach to reinforcement learning from human feedback. arXiv, 2024.
  67. Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
  68. Llama 2: Open foundation and fine-tuned chat models. arXiv, 2023.
  69. Zephyr: Direct distillation of LM alignment. arXiv, 2023.
  70. Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. arXiv, 2023.
  71. Deep TAMER: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  72. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the International Conference on Machine Learning, 2022.
  73. Fine-grained human feedback gives better rewards for language model training, 2023.
  74. Is dpo superior to ppo for llm alignment? a comprehensive study. arxiv preprint arXiv:2404.10719, 2024.
  75. Self-rewarding language models, 2024a.
  76. Uni-rlhf: Universal platform and benchmark suite for reinforcement learning with diverse human feedback, 2024b.
  77. Rrhf: Rank responses to align language models with human feedback without tears. arXiv, abs/2304.05302, 2023.
  78. Token-level direct preference optimization, 2024.
  79. Improving reinforcement learning from human feedback with efficient reward model ensemble, 2024.
  80. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv, 2023a.
  81. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023b.
  82. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  83. Consequences of misaligned AI. In Advances in Neural Information Processing Systems, 2020.
  84. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
  85. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (18)
  1. Pierre Harvey Richemond (5 papers)
  2. Yunhao Tang (63 papers)
  3. Daniel Guo (7 papers)
  4. Daniele Calandriello (34 papers)
  5. Mohammad Gheshlaghi Azar (31 papers)
  6. Rafael Rafailov (37 papers)
  7. Bernardo Avila Pires (21 papers)
  8. Eugene Tarassov (7 papers)
  9. Lucas Spangher (13 papers)
  10. Will Ellsworth (1 paper)
  11. Aliaksei Severyn (29 papers)
  12. Jonathan Mallinson (13 papers)
  13. Lior Shani (16 papers)
  14. Gil Shamir (4 papers)
  15. Rishabh Joshi (23 papers)
  16. Tianqi Liu (49 papers)
  17. Bilal Piot (40 papers)
  18. Remi Munos (45 papers)
Citations (11)