Papers
Topics
Authors
Recent
2000 character limit reached

Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions (2401.00330v3)

Published 30 Dec 2023 in cs.LG and cs.AI

Abstract: Preference-based reinforcement learning (PBRL) in the offline setting has succeeded greatly in industrial applications such as chatbots. A two-step learning framework where one applies a reinforcement learning step after a reward modeling step has been widely adopted for the problem. However, such a method faces challenges from the risk of reward hacking and the complexity of reinforcement learning. To overcome the challenge, our insight is that both challenges come from the state-actions not supported in the dataset. Such state-actions are unreliable and increase the complexity of the reinforcement learning problem at the second step. Based on the insight, we develop a novel two-step learning method called PRC: preference-based reinforcement learning with constrained actions. The high-level idea is to limit the reinforcement learning agent to optimize over a constrained action space that excludes the out-of-distribution state-actions. We empirically verify that our method has high learning efficiency on various datasets in robotic control environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500, 2021.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  3. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  4. Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning, pp.  3852–3878. PMLR, 2022.
  5. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  6. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  7. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp.  2052–2062. PMLR, 2019.
  8. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89:123–156, 2012.
  9. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.  1861–1870. PMLR, 2018.
  10. Deep reinforcement learning for intelligent transportation systems: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(1):11–32, 2020.
  11. Inverse preference learning: Preference-based rl without a reward function. arXiv preprint arXiv:2305.15363, 2023.
  12. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017.
  13. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31, 2018.
  14. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  15. Preference transformer: Modeling human preferences using transformers for rl. arXiv preprint arXiv:2303.00957, 2023.
  16. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning, pp.  6131–6141. PMLR, 2021a.
  17. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091, 2021b.
  18. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  19. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Communications Surveys & Tutorials, 21(4):3133–3174, 2019.
  20. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  21. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  22. Schmidhuber, J. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
  23. Reinforcement learning: An introduction. MIT press, 2018.
  24. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.  5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.
  25. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648, 2018.
  26. Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(4):930–936, 2008.
  27. Preference-based reinforcement learning: A preliminary survey. In Proceedings of the ECML/PKDD-13 Workshop on Reinforcement Learning from Generalized Feedback: Beyond Numeric Rewards. Citeseer, 2013.
  28. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
  29. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  30. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  31. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
  32. Provable offline reinforcement learning with human feedback. arXiv preprint arXiv:2305.14816, 2023.
  33. Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270, 2023.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.