Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards (2402.17975v1)
Abstract: Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation (z{sa}) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from (z{sa}), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83\% and 66\% of ground truth reward policy performance versus only 38\% and 21\%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: \texttt{https://github.com/apple/ml-reed}.
- Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
- Inverse reward design. Advances in Neural Information Processing Systems, 30, 2017.
- Deep reinforcement learning from human preferences. volume 30, 2017.
- PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In Proceedings of the International Conference on Machine Learning, pages 6152–6163. PMLR, 2021.
- Reward learning from human preferences and demonstrations in Atari. Advances in Neural Information Processing Systems, 31, 2018.
- Cooperative inverse reinforcement learning. Advances in Neural Information Processing Systems, 29, 2016.
- Scalable agent alignment via reward modeling: A research direction. arXiv preprint arXiv:1811.07871, 2018.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
- B-Pref: Benchmarking preference-based reinforcement learning. Neural Information Processing Systems, 2021.
- SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. arXiv preprint arXiv:2203.10050, 2022.
- Meta-Reward-Net: Implicitly differentiable reward learning for preference-based reinforcement learning. Advances in Neural Information Processing Systems, 35, 2022.
- W. Knox and P. Stone. Tamer: Training an agent manually via evaluative reinforcement. In Proceedings of the International Conference on Development and Learning, pages 292–297. IEEE, 2008.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011.
- Reward uncertainty for exploration in preference-based reinforcement learning. 2022.
- Data-efficient reinforcement learning with self-predictive representations. In Proceedings of the International Conference on Learning Representations, 2020.
- Preference-based policy learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 12–27. Springer, 2011.
- April: Active preference learning-based reinforcement learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 116–131. Springer, 2012.
- A Bayesian approach for policy learning from trajectory preference queries. Advances in Neural Information Processing Systems, 25, 2012.
- Preference-learning based inverse reinforcement learning for dialog control. In Proceedings of Interspeech, 2012.
- C. Wirth and J. Fürnkranz. Preference-based reinforcement learning: A preliminary survey. In Proceedings of the Workshop on Reinforcement Learning from Generalized Feedback: Beyond Numeric Rewards, 2013.
- Model-free preference-based reinforcement learning. In Proceedings of the Conference on Artificial Intelligence (AAAI), 2016.
- Active preference-based learning of reward functions. 2017.
- Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In Proceedings of the International Conference on Rehabilitation Robotics, pages 1–7. IEEE, 2011.
- Interactive learning from policy-dependent human feedback. In Proceedings of the International Conference on Machine Learning, pages 2285–2294. PMLR, 2017.
- Deep reinforcement learning from policy-dependent human feedback. arXiv preprint arXiv:1902.04257, 2019.
- Solar: Deep structured representations for model-based reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 7444–7453. PMLR, 2019.
- End-to-end robotic reinforcement learning without reward engineering. arXiv preprint arXiv:1904.07854, 2019.
- AVID: Learning multi-stage tasks via pixel-level translation of human videos. arXiv preprint arXiv:1912.04443, 2019.
- W. Knox and P. Stone. Interactively shaping agents via human reinforcement: The TAMER framework. In Proceedings of the International Conference on Knowledge Capture, pages 9–16, 2009.
- Deep Tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the Conference on Artificial Intelligence (AAAI), volume 32, 2018.
- Pitfalls of learning a reward function online. arXiv preprint arXiv:2004.13654, 2020.
- Sirl: Similarity-based implicit representation learning. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, pages 565–574, 2023.
- Skill preferences: Learning to extract and execute robotic skills from human feedback. In Proceedings of the Conference on Robot Learning, pages 1259–1268. PMLR, 2021.
- Learning invariant representations for reinforcement learning without reconstruction. arXiv preprint arXiv:2006.10742, 2020.
- Mastering Atari games with limited data. Advances in Neural Information Processing Systems, 34, 2021.
- Safe imitation learning via fast bayesian reward inference from preferences. In International Conference on Machine Learning, pages 1165–1177. PMLR, 2020.
- A ranking game for imitation learning. arXiv preprint arXiv:2202.03481, 2022.
- Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR, 2019.
- Learning from suboptimal demonstration via self-supervised reward regression. In Conference on robot learning, pages 1262–1277. PMLR, 2021.
- Learning latent dynamics for planning from pixels. In Proceedings of the International Conference on Machine Learning, pages 2555–2565. PMLR, 2019.
- Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33:741–752, 2020.
- R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT press, 2018.
- E. Biyik and D. Sadigh. Batch active preference-based learning of reward functions. In Proceedings of the Conference on Robot Learning, pages 519–528. PMLR, 2018.
- Active preference-based gaussian process regression for reward learning. In Proceedings of the Robotics: Science and Systems, 2020.
- R. Bradley and M. Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- X. Chen and K. He. Exploring simple Siamese representation learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
- A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, pages 1597–1607. PMLR, 2020.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Deep reinforcement and InfoMax learning. Advances in Neural Information Processing Systems, 33:3686–3698, 2020.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Meta-World: A benchmark and evaluation for multi-task and meta reinforcement learning. In Proceedings of the Conference on Robot Learning, pages 1094–1100. PMLR, 2020.
- D. Kingma and J. Ba. ADAM: A method for stochastic optimization. volume 3, 2015.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
- Proximal policy optimization algorithms, 2017.
- B-Pref, 2021. URL https://github.com/rll-research/BPref.
- Towards more generalizable one-shot visual imitation learning. In Proceedings of the International Conference on Robotics and Automation, pages 2434–2444. IEEE, 2022.
- Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards. In Proceedings of the Conference on Robot Learning, pages 485–497. PMLR, 2021.
- Katherine Metcalf (16 papers)
- Miguel Sarabia (9 papers)
- Natalie Mackraz (6 papers)
- Barry-John Theobald (34 papers)