Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dataset Reset Policy Optimization for RLHF (2404.08495v3)

Published 12 Apr 2024 in cs.LG, cs.AI, and cs.CL
Dataset Reset Policy Optimization for RLHF

Abstract: Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo.

Reinforcement Learning from Human Feedback with Dataset Reset Policy Optimization

Introduction

Reinforcement Learning from Human Feedback (RLHF) has emerged as a potent strategy for training generative models in scenarios where crafting an explicit reward function proves challenging. Utilizing human-labeled preference data, researchers have successfully trained large-scale models across diverse domains. Despite its successes, conventional RLHF protocols separate the processes of reward model learning and policy optimization, potentially overlooking the wealth of information embedded in the offline preference dataset during online policy training. This paper introduces an innovative RLHF algorithm, Dataset Reset Policy Optimization (DR-PO), leveraging dataset resets to enhance online learning significantly.

Dataset Reset Policy Optimization (DR-PO)

DR-PO capitalizes on the ability to reset to informative states within an offline dataset, enabling more efficient policy optimization. By resetting the learning agent directly to states from this dataset instead of initiating from the traditional starting state distribution, DR-PO increases exploration efficiency. This mechanism particularly benefits scenarios such as text generation in LLMs, where resets correspond to initiating generation from partial sentence states. Theoretical analysis confirms that DR-PO can match or surpass the performance of any policy covered by the offline data, offering a significant leap in efficiency and effectiveness within the RLHF paradigm.

Theoretical Guarantees

DR-PO not only showcases simplicity in implementation akin to traditional policy optimization methods but also sets a new theoretical benchmark for RLHF. Under general function approximation and finite sample complexity conditions, DR-PO guarantees learning policies at least as effective as those encapsulated within the offline preference dataset. This theoretical robustness extends to computationally tractable settings, requiring only standard learning oracles such as Maximum Likelihood Estimation (MLE) for reward model fitting. Thus, DR-PO represents a significant theoretical advancement in the RLHF domain.

Empirical Demonstrations

The paper rigorously evaluates DR-PO on two standard RLHF benchmarks: TL;DR Summarization and the Anthropic Helpful Harmful (HH) dataset, employing methods such as Proximal Policy Optimization (PPO) for comparison. Notably, DR-PO outperforms PPO and Direction Preference Optimization (DPO) across these benchmarks. In TL;DR summarization tasks, DR-PO's summaries notably surpass those delivered by PPO and DPO, evaluated using GPT-4 win-rate. Moreover, when transitioning the trained policies to a zero-shot setting on the CNN/DailyMail dataset, DR-PO maintains its superiority, highlighting its robustness and generalizability beyond the training data. These empirical outcomes solidify DR-PO's practical efficacy in optimizing RLHF tasks, blending theoretical soundness with real-world applicability.

Conclusion and Future Directions

Dataset Reset Policy Optimization introduces a pivotal advancement in the domain of RLHF, substantiated by both theoretical guarantees and strong empirical performance. The capability to leverage dataset resets in policy optimization presents a novel pathway toward more efficient and effective learning from human feedback. As the paper conjectures, the principles underpinning DR-PO may extend beyond the settings explored, suggesting a broad horizon for future investigations. The integration of dataset resets offers a promising avenue to enhance online RL algorithms further, warranting comprehensive exploration across diverse RLHF applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Reinforcement learning: Theory and algorithms. Technical report.
  3. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506.
  4. Reinforcement learning with a near optimal rate of convergence. Technical report, INRIA.
  5. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org.
  6. Bagnell, J. A. (2004). Learning decisions: Robustness, uncertainty, and approximation. Carnegie Mellon University.
  7. Covariant policy search. In Proceedings of the 18th international joint conference on Artificial intelligence, pages 1019–1024.
  8. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  9. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  10. Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948.
  11. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  12. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  13. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR.
  14. Learning to generate better than your llm. arXiv preprint arXiv:2306.11816.
  15. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, pages 3773–3793. PMLR.
  16. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  17. Contextual.ai (2023). https://contextual.ai/better-cheaper-faster-llm-alignment-with-kto/.
  18. Search-based structured prediction. Machine learning, 75:297–325.
  19. Learning as search optimization: Approximate large margin methods for structured prediction. In Proceedings of the 22nd international conference on Machine learning, pages 169–176.
  20. Bilinear classes: A structural framework for provable generalization in rl.
  21. Contextual dueling bandits. In Conference on Learning Theory, pages 563–587. PMLR.
  22. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
  23. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  24. Contextual decision processes with low bellman rank are pac-learnable. arXiv preprint arXiv:1610.09512.
  25. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
  26. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, volume 2, pages 267–274.
  27. Kakade, S. M. (2001). A natural policy gradient. Advances in neural information processing systems, 14.
  28. Kakade, S. M. (2003). On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom).
  29. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192.
  30. Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438.
  31. Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  32. Languages are rewards: Hindsight finetuning using human feedback. arXiv preprint arXiv:2302.02676.
  33. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning, pages 2285–2294. PMLR.
  34. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on robotics and automation (ICRA), pages 6292–6299. IEEE.
  35. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  36. Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 1029–1038. PMLR.
  37. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  38. Dueling rl: reinforcement learning with trajectory preferences. arXiv preprint arXiv:2111.04850.
  39. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
  40. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
  41. Learning montezuma’s revenge from a single demonstration. arXiv preprint arXiv:1812.03381.
  42. Trust region policy optimization. In International conference on machine learning, pages 1889–1897.
  43. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  44. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  45. Benchmarks and algorithms for offline preference-based reward learning. arXiv preprint arXiv:2301.01392.
  46. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484–489.
  47. Hybrid rl: Using both offline and online data can make rl efficient. arXiv preprint arXiv:2210.06718.
  48. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  49. Demonstration-regularized rl.
  50. Jump-start reinforcement learning. In International Conference on Machine Learning, pages 34556–34583. PMLR.
  51. Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  52. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46.
  53. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
  54. Making rl with preference-based feedback efficient via randomization. arXiv preprint arXiv:2310.14554.
  55. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment. arXiv preprint arXiv:2310.00212.
  56. Preference-based reinforcement learning with finite-time guarantees. Advances in Neural Information Processing Systems, 33:18784–18794.
  57. Efficient local planning with linear function approximation. In International Conference on Algorithmic Learning Theory, pages 1165–1192. PMLR.
  58. Self-rewarding language models. arXiv preprint arXiv:2401.10020.
  59. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556.
  60. Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, pages 2730–2775. PMLR.
  61. Provable offline preference-based reinforcement learning.
  62. Provable reward-agnostic preference-based reinforcement learning.
  63. Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270.
  64. Fine-tuning language models with advantage-induced policy alignment. arXiv preprint arXiv:2306.02231.
  65. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
  66. Relative confidence sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 73–82.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jonathan D. Chang (10 papers)
  2. Owen Oertell (8 papers)
  3. Kianté Brantley (25 papers)
  4. Dipendra Misra (34 papers)
  5. Jason D. Lee (151 papers)
  6. Wen Sun (124 papers)
  7. Wenhao Zhan (17 papers)
Citations (16)
Reddit Logo Streamline Icon: https://streamlinehq.com