Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 30 tok/s Pro
GPT-5 Medium 25 tok/s
GPT-5 High 37 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 479 tok/s Pro
Kimi K2 242 tok/s Pro
2000 character limit reached

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint (2312.11456v4)

Published 18 Dec 2023 in cs.LG, cs.AI, and stat.ML

Abstract: This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. Then, to understand the mathematical principle of RLHF, we consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees. Moving towards practical applications, our framework, with a robust approximation of the information-theoretical policy improvement oracle, naturally gives rise to several novel RLHF algorithms. This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios. Our empirical evaluations on real-world alignment experiment of LLM demonstrate that these proposed methods significantly surpass existing strong baselines, such as DPO and Rejection Sampling Optimization (RSO), showcasing the connections between solid theoretical foundations and their potent practical implementations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  2. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32, 2019.
  3. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506, 2021.
  4. Vo q𝑞qitalic_q l: Towards optimal regret in model-free rl with nonlinear function approximation. In The Thirty Sixth Annual Conference on Learning Theory, pp.  987–1063. PMLR, 2023.
  5. Anthropic. Introducing claude. 2023. URL https://www.anthropic.com/index/introducing-claude.
  6. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  7. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
  8. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  9. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  10. Peering through preferences: Unraveling feedback acquisition for aligning large language models. arXiv preprint arXiv:2308.15812, 2023.
  11. Preference-based online learning with dueling bandits: A survey. The Journal of Machine Learning Research, 22(1):278–385, 2021.
  12. Adversarial model for offline reinforcement learning. arXiv preprint arXiv:2302.11048, 2023.
  13. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  14. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp.  1283–1294. PMLR, 2020.
  15. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  16. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, pp.  3773–3793. PMLR, 2022.
  17. On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752, 2019.
  18. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  19. Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
  20. Stochastic linear optimization under bandit feedback. 2008.
  21. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  22. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420, 2023.
  23. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY.
  24. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.
  25. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  26. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
  27. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pp.  3052–3060. PMLR, 2020.
  28. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.  10835–10866. PMLR, 2023.
  29. Fast rates in pool-based batch active learning. arXiv preprint arXiv:2202.05448, 2022.
  30. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  31. Google. Bard. 2023. URL https://bard.google.com/.
  32. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263, 2019.
  33. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  34. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
  35. Nearly minimax optimal reinforcement learning with linear function approximation. In International Conference on Machine Learning, pp.  8971–9019. PMLR, 2022.
  36. Towards general function approximation in zero-sum markov games. arXiv preprint arXiv:2107.14702, 2021.
  37. The power of exploiter: Provable multi-agent rl in large state spaces. arXiv preprint arXiv:2106.03352, 2021a.
  38. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp.  5084–5096. PMLR, 2021b.
  39. A framework of composite functional gradient methods for generative adversarial models. IEEE transactions on pattern analysis and machine intelligence, 43(1):17–32, 2019.
  40. Provably feedback-efficient reinforcement learning via active reward learning. Advances in Neural Information Processing Systems, 35:11063–11078, 2022.
  41. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  42. Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438, 2023.
  43. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  44. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023a.
  45. Maximize to explore: One objective function fusing estimation, planning, and exploration. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  46. Understanding learned reward functions. arXiv preprint arXiv:2012.05862, 2020.
  47. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  48. Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pp.  1029–1038. PMLR, 2020.
  49. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  50. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  51. Dueling rl: reinforcement learning with trajectory preferences. arXiv preprint arXiv:2111.04850, 2021.
  52. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  53. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
  54. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  55. Aadirupa Saha. Optimal algorithms for stochastic contextual preference bandits. Advances in Neural Information Processing Systems, 34:30050–30062, 2021.
  56. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  57. Hybrid rl: Using both offline and online data can make rl efficient. arXiv preprint arXiv:2210.06718, 2022.
  58. Reward collapse in aligning large language models. arXiv preprint arXiv:2305.17608, 2023.
  59. Causal confusion and reward misidentification in preference-based reward learning. arXiv preprint arXiv:2204.06601, 2022.
  60. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  61. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  62. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. arXiv preprint arXiv:2309.16240, 2023a.
  63. Is rlhf more difficult than standard rl? arXiv preprint arXiv:2306.14111, 2023b.
  64. Enable language models to implicitly learn self-improvement from data. arXiv preprint arXiv:2310.00898, 2023c.
  65. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
  66. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
  67. Making rl with preference-based feedback efficient via randomization. arXiv preprint arXiv:2310.14554, 2023.
  68. Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420, 2023.
  69. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694, 2021.
  70. A self-play posterior sampling algorithm for zero-sum markov games. In International Conference on Machine Learning, pp.  24496–24523. PMLR, 2022.
  71. Contrastive post-training large language models on data curriculum. arXiv preprint arXiv:2310.02263, 2023.
  72. Preference-based reinforcement learning with finite-time guarantees. Advances in Neural Information Processing Systems, 33:18784–18794, 2020.
  73. Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In International Conference on Machine Learning, pp.  39834–39863. PMLR, 2023.
  74. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  75. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
  76. Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory, pp.  4473–4525. PMLR, 2021a.
  77. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626–13640, 2021b.
  78. Provable offline reinforcement learning with human feedback. arXiv preprint arXiv:2305.14816, 2023a.
  79. How to query human feedback efficiently in rl? arXiv preprint arXiv:2305.18505, 2023b.
  80. Tong Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. SIAM Journal on Mathematics of Data Science, 4(2):834–857, 2022.
  81. Tong Zhang. Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press, 2023. doi: 10.1017/9781009093057.
  82. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  83. A theoretical analysis of optimistic proximal policy optimization in linear markov decision processes. arXiv preprint arXiv:2305.08841, 2023.
  84. Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270, 2023a.
  85. Fine-tuning language models with advantage-induced policy alignment. arXiv preprint arXiv:2306.02231, 2023b.
  86. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Citations (99)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces iterative algorithms for RLHF under KL constraints, providing finite-sample theoretical guarantees.
  • It formulates RLHF as a reverse-KL regularized contextual bandit problem applicable in offline, online, and hybrid settings.
  • Empirical results demonstrate that the proposed approach outperforms baselines like DPO and RSO in large language model alignment tasks.

An Expert Review of "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint"

The paper "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint" presents a detailed exploration of a theoretical framework meant to optimize the alignment of generative models with human preferences through Reinforcement Learning from Human Feedback (RLHF). The research addresses the reverse-KL regularized contextual bandit problem, a common yet theoretically underexplored approach in RLHF, analyzing it within offline, online, and hybrid contexts. This paper also introduces algorithms with finite-sample theoretical guarantees, highlighting their efficiency and empirical superiority over traditional baselines in LLM alignment tasks.

Theoretical Framework and Settings

The paper rigorously formalizes RLHF as a reverse-KL regularized contextual bandit problem. This mathematical formulation involves an expectation over states sampled from a distribution and actions governed by policies, with a regularization term based on Kullback-Leibler divergence that ensures policies do not deviate excessively from a pre-trained policy, denoted as the starting checkpoint.

Three distinct settings are considered:

  • Offline Learning focuses on deriving policy from pre-existing datasets without new interactions.
  • Online Learning exploits ongoing interactions to refine policies continuously.
  • Hybrid Learning combines offline initialization with online data collection to optimize policy updates iteratively.

Empirical and Theoretical Contributions

The paper introduces algorithms for all three settings, embedding core concepts like pessimism and optimism to address spurious correlation challenges inherent in preference-based learning. In offline settings, a pessimistic reward model corrects for potential over-optimism inherent in theoretic models, while in online settings, a dual-agent strategy facilitates expansive exploration. This dual-agent framework effectively separates exploration and exploitation roles between two iteratively improving policies, enhancing learning efficiency and robust policy learning.

Importantly, this paper not only provides a theorized convergence guarantee but ties it to practical applications. The proposed algorithms are empirically validated through a series of real-world LLM alignment experiments, where new models consistently outperform state-of-the-art baselines like Direct Preference Optimization (DPO) and Rejection Sampling Optimization (RSO).

Implications and Future Directions

The intersection of theory and empirical validation provides new insights into RLHF, encouraging further exploration of KL-constraint optimization in practical settings. Although traditional methods such as DPO effectively address preference modeling without explicit reward modeling, this research suggests that iterative learning with enriched feedback datasets significantly enhances performance, reducing issues like reward hacking. This advances our understanding of effective policy optimization amid imperfect feedback settings, as commonly encountered in real-world AI systems.

Future research might explore exploring additional settings or extending the hybrid framework with more complex recursive feedback loops, optimizing the trade-off between computational efficiency and the richness of human-like model behavior. Additionally, refining uncertainty estimation for both offline and online settings can profoundly enhance policy robustness, offering exciting prospects for deploying more adaptive and accurate LLMs.

Conclusion

This research contributes a theoretically grounded, empirically validated advancement to RLHF methodologies, emphasizing iterative techniques and preference-based learning for generative models. The introduced algorithms encapsulate a sophisticated interplay of randomness, respect for initial distribution, and informed preference learning—a step forward in refining AI alignment with humanistic values and expectations.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube