REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback (2312.14436v2)
Abstract: The effectiveness of reinforcement learning (RL) agents in continuous control robotics tasks is heavily dependent on the design of the underlying reward function. However, a misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences; however, they inadvertently introduce a risk of reward overoptimization. In this work, we address this challenge by advocating for the adoption of regularized reward functions that more accurately mirror the intended behaviors. We propose a novel concept of reward regularization within the robotic RLHF (RL from Human Feedback) framework, which we refer to as \emph{agent preferences}. Our approach uniquely incorporates not just human feedback in the form of preferences but also considers the preferences of the RL agent itself during the reward function learning process. This dual consideration significantly mitigates the issue of reward function overoptimization in RL. We provide a theoretical justification for the proposed approach by formulating the robotic RLHF problem as a bilevel optimization problem. We demonstrate the efficiency of our algorithm {\ours} in several continuous control benchmarks including DeepMind Control Suite \cite{tassa2018deepmind} and MetaWorld \cite{yu2021metaworld} and high dimensional visual environments, with an improvement of more than 70\% in sample efficiency in comparison to current SOTA baselines. This showcases our approach's effectiveness in aligning reward functions with true behavioral intentions, setting a new benchmark in the field.
- O. Kilinc and G. Montana, “Reinforcement learning for robotic manipulation using simulated locomotion demonstrations,” Machine Learning, pp. 1–22, 2021.
- M. Everett, Y. F. Chen, and J. P. How, “Motion planning among dynamic, decision-making agents with deep reinforcement learning,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3052–3059.
- L. Liu, D. Dugas, G. Cesari, R. Siegwart, and R. Dubé, “Robot navigation in crowded environments using deep reinforcement learning,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5671–5677.
- K. Weerakoon, A. J. Sathyamoorthy, U. Patel, and D. Manocha, “Terp: Reliable planning in uneven outdoor environments using deep reinforcement learning,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 9447–9453.
- K. Weerakoon, S. Chakraborty, N. Karapetyan, A. J. Sathyamoorthy, A. Bedi, and D. Manocha, “Htron: Efficient outdoor navigation with sparse rewards via heavy tailed adaptive reinforce algorithm,” in 6th Annual Conference on Robot Learning.
- A. S. Bedi, S. Chakraborty, A. Parayil, B. M. Sadler, P. Tokekar, and A. Koppel, “On the hidden biases of policy mirror ascent in continuous action spaces,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 17–23 Jul 2022, pp. 1716–1731. [Online]. Available: https://proceedings.mlr.press/v162/bedi22a.html
- “Technical report for ”dealing with sparse rewards in continuous control robotics via heavy-tailed policy optimization”,” 2022. [Online]. Available: https://tinyurl.com/43v3783j
- P. Rauber, A. Ummadisingu, F. Mutz, and J. Schmidhuber, “Reinforcement learning in sparse-reward environments with hindsight policy gradients,” Neural Computation, vol. 33, no. 6, pp. 1498–1553, 2021.
- W. B. Knox, A. Allievi, H. Banzhaf, F. Schmitt, and P. Stone, “Reward (mis)design for autonomous driving,” 2022.
- J. Skalse, N. H. R. Howe, D. Krasheninnikov, and D. Krueger, “Defining and characterizing reward hacking,” 2022.
- C. Wirth, R. Akrour, G. Neumann, and J. Fürnkranz, “A survey of preference-based reinforcement learning methods,” Journal of Machine Learning Research, vol. 18, no. 136, pp. 1–46, 2017. [Online]. Available: http://jmlr.org/papers/v18/16-634.html
- K. Lee, L. Smith, and P. Abbeel, “Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training,” 2021.
- B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning,” in Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, ser. AAAI’08. AAAI Press, 2008, p. 1433–1438.
- B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “Modeling interaction via the principle of maximum causal entropy,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, ser. ICML’10. Madison, WI, USA: Omnipress, 2010, p. 1255–1262.
- A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation learning: A survey of learning methods,” ACM Computing Surveys (CSUR), vol. 50, no. 2, pp. 1–35, 2017.
- D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” 2018.
- T. Xie, D. J. Foster, Y. Bai, N. Jiang, and S. M. Kakade, “The role of coverage in online reinforcement learning,” 2022.
- P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” 2017.
- R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, p. 324, 1952. [Online]. Available: https://api.semanticscholar.org/CorpusID:125209808
- L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” 2022.
- P. Evans, “Money, output and goodhart’s law: The u.s. experience,” The Review of Economics and Statistics, vol. 67, no. 1, pp. 1–8, 1985. [Online]. Available: http://www.jstor.org/stable/1928428
- R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” 2023.
- Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al., “Deepmind control suite,” arXiv preprint arXiv:1801.00690, 2018.
- J. Park, Y. Seo, J. Shin, H. Lee, P. Abbeel, and K. Lee, “Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning,” 2022.
- M. J. Mataric, “Reward functions for accelerated learning,” in Machine learning proceedings 1994. Elsevier, 1994, pp. 181–189.
- A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 6292–6299.
- R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel, “Vime: Variational information maximizing exploration,” Advances in neural information processing systems, vol. 29, 2016.
- M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz, “Parameter space noise for exploration,” arXiv preprint arXiv:1706.01905, 2017.
- D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in International conference on machine learning. PMLR, 2017, pp. 2778–2787.
- S. Chakraborty, A. S. Bedi, A. Koppel, M. Wang, F. Huang, and D. Manocha, “Steering: Stein information directed exploration for model-based reinforcement learning,” 2023. [Online]. Available: https://arxiv.org/abs/2301.12038
- B. Hao and T. Lattimore, “Regret bounds for information-directed reinforcement learning,” 2022.
- A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” in Proceedings of the Seventeenth International Conference on Machine Learning, ser. ICML ’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2000, p. 663–670.
- P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Proceedings of the Twenty-First International Conference on Machine Learning, ser. ICML ’04. New York, NY, USA: Association for Computing Machinery, 2004, p. 1. [Online]. Available: https://doi.org/10.1145/1015330.1015430
- K. Shiarlis, J. Messias, and S. Whiteson, “Inverse reinforcement learning from failure,” in Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, ser. AAMAS ’16. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems, 2016, p. 1060–1068.
- J. Ho, J. K. Gupta, and S. Ermon, “Model-free imitation learning with policy optimization,” 2016.
- D. S. Brown, W. Goo, P. Nagarajan, and S. Niekum, “Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations,” 2019.
- A. Wilson, A. Fern, and P. Tadepalli, “A bayesian approach for policy learning from trajectory preference queries,” in Advances in Neural Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., vol. 25. Curran Associates, Inc., 2012. [Online]. Available: https://proceedings.neurips.cc/paper˙files/paper/2012/file/16c222aa19898e5058938167c8ab6c57-Paper.pdf
- J. Fürnkranz, E. Hüllermeier, W. Cheng, and S.-H. Park, “Preference-based reinforcement learning: A formal framework and a policy iteration algorithm,” Mach. Learn., vol. 89, no. 1–2, p. 123–156, oct 2012. [Online]. Available: https://doi.org/10.1007/s10994-012-5313-8
- M. Dudík, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi, “Contextual dueling bandits,” 2015.
- V. Bengs, R. Busa-Fekete, A. E. Mesaoudi-Paul, and E. Hüllermeier, “Preference-based online learning with dueling bandits: A survey,” Journal of Machine Learning Research, vol. 22, no. 7, pp. 1–108, 2021. [Online]. Available: http://jmlr.org/papers/v22/18-546.html
- T. Lekang and A. Lamperski, “Simple algorithms for dueling bandits,” 2019.
- A. Pacchiano, A. Saha, and J. Lee, “Dueling rl: Reinforcement learning with trajectory preferences,” 2023.
- B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei, “Reward learning from human preferences and demonstrations in atari,” 2018.
- Z. Cao, K. Wong, and C.-T. Lin, “Weak human preference supervision for deep reinforcement learning,” 2020.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
- “Technical report,” 2023. [Online]. Available: https://raaslab.org/projects/REBEL
- Souradip Chakraborty (36 papers)
- Amisha Bhaskar (9 papers)
- Anukriti Singh (7 papers)
- Pratap Tokekar (96 papers)
- Dinesh Manocha (366 papers)
- Amrit Singh Bedi (75 papers)