Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond (2310.06147v1)
Abstract: Recent advancements in LLMs have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). In this paper, we aim to link the research in conventional RL to RL techniques used in LLM research. Demystify this technique by discussing why, when, and how RL excels. Furthermore, we explore potential future avenues that could either benefit from or contribute to RLHF research. Highlighted Takeaways: 1. RLHF is Online Inverse RL with Offline Demonstration Data. 2. RLHF $>$ SFT because Imitation Learning (and Inverse RL) $>$ Behavior Cloning (BC) by alleviating the problem of compounding error. 3. The RM step in RLHF generates a proxy of the expensive human feedback, such an insight can be generalized to other LLM tasks such as prompting evaluation and optimization where feedback is also expensive. 4. The policy learning in RLHF is more challenging than conventional problems studied in IRL due to their high action dimensionality and feedback sparsity. 5. The main superiority of PPO over off-policy value-based methods is its stability gained from (almost) on-policy data and conservative policy updates.
- Safe exploration by solving early terminated mdp. arXiv preprint arXiv:2107.04200, 2021.
- Recurrent model-free rl can be a strong baseline for many pomdps. arXiv preprint arXiv:2110.05038, 2021.
- Policy continuation with hindsight inverse dynamics. Advances in Neural Information Processing Systems, 32, 2019.
- Zeroth-order supervised policy improvement. arXiv preprint arXiv:2006.06600, 2020a.
- Novel policy seeking with constrained optimization. arXiv preprint arXiv:2005.10696, 2020b.
- Constrained mdps can be solved by eearly-termination with recurrent models. In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022a.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
- Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG), 37(4):1–14, 2018.
- Exploit reward shifting in value-based deep-rl: Optimistic curiosity-based exploration and conservative exploitation via linear reward shaping. Advances in Neural Information Processing Systems, 35:37719–37734, 2022b.
- Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
- Strictly batch imitation learning by energy-based distribution matching. Advances in Neural Information Processing Systems, 33:7354–7365, 2020.
- Accountable batched control with decision corpus. Advances in Neural Information Processing Systems, 36, 2023.
- Stefan Schaal. Learning from demonstration. Advances in neural information processing systems, 9, 1996.
- Deep q-learning from demonstrations. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on robotics and automation (ICRA), pages 6292–6299. IEEE, 2018.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017a.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- Rethinking goal-conditioned supervised learning and its connection to offline rl. arXiv preprint arXiv:2202.04478, 2022.
- Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR, 2019.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
- Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
- Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
- Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
- Languages are rewards: Hindsight finetuning using human feedback. arXiv preprint arXiv:2302.02676, 2023.
- The wisdom of hindsight makes language models better instruction followers. arXiv preprint arXiv:2302.05206, 2023.
- Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017b.
- Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. Pmlr, 2014.
- Combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626, 2016.
- Reinforcement learning with deep energy-based policies. In International conference on machine learning, pages 1352–1361. PMLR, 2017.
- Hao Sun. Offline prompt evaluation and optimization with inverse reinforcement learning. arXiv preprint arXiv:2309.06553, 2023.
- Hao Sun (383 papers)