Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint (2312.11456v4)
Abstract: This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. Then, to understand the mathematical principle of RLHF, we consider a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF. Despite its widespread practical application, a rigorous theoretical analysis of this formulation remains open. We investigate its behavior in three distinct settings -- offline, online, and hybrid -- and propose efficient algorithms with finite-sample theoretical guarantees. Moving towards practical applications, our framework, with a robust approximation of the information-theoretical policy improvement oracle, naturally gives rise to several novel RLHF algorithms. This includes an iterative version of the Direct Preference Optimization (DPO) algorithm for online settings, and a multi-step rejection sampling strategy for offline scenarios. Our empirical evaluations on real-world alignment experiment of LLM demonstrate that these proposed methods significantly surpass existing strong baselines, such as DPO and Rejection Sampling Optimization (RSO), showcasing the connections between solid theoretical foundations and their potent practical implementations.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
- Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32, 2019.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506, 2021.
- Vo q𝑞qitalic_q l: Towards optimal regret in model-free rl with nonlinear function approximation. In The Thirty Sixth Annual Conference on Learning Theory, pp. 987–1063. PMLR, 2023.
- Anthropic. Introducing claude. 2023. URL https://www.anthropic.com/index/introducing-claude.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Peering through preferences: Unraveling feedback acquisition for aligning large language models. arXiv preprint arXiv:2308.15812, 2023.
- Preference-based online learning with dueling bandits: A survey. The Journal of Machine Learning Research, 22(1):278–385, 2021.
- Adversarial model for offline reinforcement learning. arXiv preprint arXiv:2302.11048, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283–1294. PMLR, 2020.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, pp. 3773–3793. PMLR, 2022.
- On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752, 2019.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
- Stochastic linear optimization under bandit feedback. 2008.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420, 2023.
- RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY.
- Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
- Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
- Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pp. 3052–3060. PMLR, 2020.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR, 2023.
- Fast rates in pool-based batch active learning. arXiv preprint arXiv:2202.05448, 2022.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Google. Bard. 2023. URL https://bard.google.com/.
- Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263, 2019.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
- Nearly minimax optimal reinforcement learning with linear function approximation. In International Conference on Machine Learning, pp. 8971–9019. PMLR, 2022.
- Towards general function approximation in zero-sum markov games. arXiv preprint arXiv:2107.14702, 2021.
- The power of exploiter: Provable multi-agent rl in large state spaces. arXiv preprint arXiv:2106.03352, 2021a.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084–5096. PMLR, 2021b.
- A framework of composite functional gradient methods for generative adversarial models. IEEE transactions on pattern analysis and machine intelligence, 43(1):17–32, 2019.
- Provably feedback-efficient reinforcement learning via active reward learning. Advances in Neural Information Processing Systems, 35:11063–11078, 2022.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438, 2023.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
- Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023a.
- Maximize to explore: One objective function fusing estimation, planning, and exploration. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
- Understanding learned reward functions. arXiv preprint arXiv:2012.05862, 2020.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pp. 1029–1038. PMLR, 2020.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Dueling rl: reinforcement learning with trajectory preferences. arXiv preprint arXiv:2111.04850, 2021.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
- Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
- Aadirupa Saha. Optimal algorithms for stochastic contextual preference bandits. Advances in Neural Information Processing Systems, 34:30050–30062, 2021.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Hybrid rl: Using both offline and online data can make rl efficient. arXiv preprint arXiv:2210.06718, 2022.
- Reward collapse in aligning large language models. arXiv preprint arXiv:2305.17608, 2023.
- Causal confusion and reward misidentification in preference-based reward learning. arXiv preprint arXiv:2204.06601, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. arXiv preprint arXiv:2309.16240, 2023a.
- Is rlhf more difficult than standard rl? arXiv preprint arXiv:2306.14111, 2023b.
- Enable language models to implicitly learn self-improvement from data. arXiv preprint arXiv:2310.00898, 2023c.
- A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
- Making rl with preference-based feedback efficient via randomization. arXiv preprint arXiv:2310.14554, 2023.
- Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420, 2023.
- Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694, 2021.
- A self-play posterior sampling algorithm for zero-sum markov games. In International Conference on Machine Learning, pp. 24496–24523. PMLR, 2022.
- Contrastive post-training large language models on data curriculum. arXiv preprint arXiv:2310.02263, 2023.
- Preference-based reinforcement learning with finite-time guarantees. Advances in Neural Information Processing Systems, 33:18784–18794, 2020.
- Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In International Conference on Machine Learning, pp. 39834–39863. PMLR, 2023.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
- Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory, pp. 4473–4525. PMLR, 2021a.
- Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626–13640, 2021b.
- Provable offline reinforcement learning with human feedback. arXiv preprint arXiv:2305.14816, 2023a.
- How to query human feedback efficiently in rl? arXiv preprint arXiv:2305.18505, 2023b.
- Tong Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. SIAM Journal on Mathematics of Data Science, 4(2):834–857, 2022.
- Tong Zhang. Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press, 2023. doi: 10.1017/9781009093057.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- A theoretical analysis of optimistic proximal policy optimization in linear markov decision processes. arXiv preprint arXiv:2305.08841, 2023.
- Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270, 2023a.
- Fine-tuning language models with advantage-induced policy alignment. arXiv preprint arXiv:2306.02231, 2023b.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Collections
Sign up for free to add this paper to one or more collections.