Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sample-Efficient Preference-based Reinforcement Learning with Dynamics Aware Rewards (2402.17975v1)

Published 28 Feb 2024 in cs.AI and cs.LG

Abstract: Preference-based reinforcement learning (PbRL) aligns a robot behavior with human preferences via a reward function learned from binary feedback over agent behaviors. We show that dynamics-aware reward functions improve the sample efficiency of PbRL by an order of magnitude. In our experiments we iterate between: (1) learning a dynamics-aware state-action representation (z{sa}) via a self-supervised temporal consistency task, and (2) bootstrapping the preference-based reward function from (z{sa}), which results in faster policy learning and better final policy performance. For example, on quadruped-walk, walker-walk, and cheetah-run, with 50 preference labels we achieve the same performance as existing approaches with 500 preference labels, and we recover 83\% and 66\% of ground truth reward policy performance versus only 38\% and 21\%. The performance gains demonstrate the benefits of explicitly learning a dynamics-aware reward model. Repo: \texttt{https://github.com/apple/ml-reed}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
  2. Inverse reward design. Advances in Neural Information Processing Systems, 30, 2017.
  3. Deep reinforcement learning from human preferences. volume 30, 2017.
  4. PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In Proceedings of the International Conference on Machine Learning, pages 6152–6163. PMLR, 2021.
  5. Reward learning from human preferences and demonstrations in Atari. Advances in Neural Information Processing Systems, 31, 2018.
  6. Cooperative inverse reinforcement learning. Advances in Neural Information Processing Systems, 29, 2016.
  7. Scalable agent alignment via reward modeling: A research direction. arXiv preprint arXiv:1811.07871, 2018.
  8. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  9. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
  10. B-Pref: Benchmarking preference-based reinforcement learning. Neural Information Processing Systems, 2021.
  11. SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. arXiv preprint arXiv:2203.10050, 2022.
  12. Meta-Reward-Net: Implicitly differentiable reward learning for preference-based reinforcement learning. Advances in Neural Information Processing Systems, 35, 2022.
  13. W. Knox and P. Stone. Tamer: Training an agent manually via evaluative reinforcement. In Proceedings of the International Conference on Development and Learning, pages 292–297. IEEE, 2008.
  14. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011.
  15. Reward uncertainty for exploration in preference-based reinforcement learning. 2022.
  16. Data-efficient reinforcement learning with self-predictive representations. In Proceedings of the International Conference on Learning Representations, 2020.
  17. Preference-based policy learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 12–27. Springer, 2011.
  18. April: Active preference learning-based reinforcement learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 116–131. Springer, 2012.
  19. A Bayesian approach for policy learning from trajectory preference queries. Advances in Neural Information Processing Systems, 25, 2012.
  20. Preference-learning based inverse reinforcement learning for dialog control. In Proceedings of Interspeech, 2012.
  21. C. Wirth and J. Fürnkranz. Preference-based reinforcement learning: A preliminary survey. In Proceedings of the Workshop on Reinforcement Learning from Generalized Feedback: Beyond Numeric Rewards, 2013.
  22. Model-free preference-based reinforcement learning. In Proceedings of the Conference on Artificial Intelligence (AAAI), 2016.
  23. Active preference-based learning of reward functions. 2017.
  24. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In Proceedings of the International Conference on Rehabilitation Robotics, pages 1–7. IEEE, 2011.
  25. Interactive learning from policy-dependent human feedback. In Proceedings of the International Conference on Machine Learning, pages 2285–2294. PMLR, 2017.
  26. Deep reinforcement learning from policy-dependent human feedback. arXiv preprint arXiv:1902.04257, 2019.
  27. Solar: Deep structured representations for model-based reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 7444–7453. PMLR, 2019.
  28. End-to-end robotic reinforcement learning without reward engineering. arXiv preprint arXiv:1904.07854, 2019.
  29. AVID: Learning multi-stage tasks via pixel-level translation of human videos. arXiv preprint arXiv:1912.04443, 2019.
  30. W. Knox and P. Stone. Interactively shaping agents via human reinforcement: The TAMER framework. In Proceedings of the International Conference on Knowledge Capture, pages 9–16, 2009.
  31. Deep Tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the Conference on Artificial Intelligence (AAAI), volume 32, 2018.
  32. Pitfalls of learning a reward function online. arXiv preprint arXiv:2004.13654, 2020.
  33. Sirl: Similarity-based implicit representation learning. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, pages 565–574, 2023.
  34. Skill preferences: Learning to extract and execute robotic skills from human feedback. In Proceedings of the Conference on Robot Learning, pages 1259–1268. PMLR, 2021.
  35. Learning invariant representations for reinforcement learning without reconstruction. arXiv preprint arXiv:2006.10742, 2020.
  36. Mastering Atari games with limited data. Advances in Neural Information Processing Systems, 34, 2021.
  37. Safe imitation learning via fast bayesian reward inference from preferences. In International Conference on Machine Learning, pages 1165–1177. PMLR, 2020.
  38. A ranking game for imitation learning. arXiv preprint arXiv:2202.03481, 2022.
  39. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR, 2019.
  40. Learning from suboptimal demonstration via self-supervised reward regression. In Conference on robot learning, pages 1262–1277. PMLR, 2021.
  41. Learning latent dynamics for planning from pixels. In Proceedings of the International Conference on Machine Learning, pages 2555–2565. PMLR, 2019.
  42. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. Advances in Neural Information Processing Systems, 33:741–752, 2020.
  43. R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  44. E. Biyik and D. Sadigh. Batch active preference-based learning of reward functions. In Proceedings of the Conference on Robot Learning, pages 519–528. PMLR, 2018.
  45. Active preference-based gaussian process regression for reward learning. In Proceedings of the Robotics: Science and Systems, 2020.
  46. R. Bradley and M. Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  47. X. Chen and K. He. Exploring simple Siamese representation learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
  48. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, pages 1597–1607. PMLR, 2020.
  49. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  50. Deep reinforcement and InfoMax learning. Advances in Neural Information Processing Systems, 33:3686–3698, 2020.
  51. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  52. Meta-World: A benchmark and evaluation for multi-task and meta reinforcement learning. In Proceedings of the Conference on Robot Learning, pages 1094–1100. PMLR, 2020.
  53. D. Kingma and J. Ba. ADAM: A method for stochastic optimization. volume 3, 2015.
  54. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  55. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
  56. Proximal policy optimization algorithms, 2017.
  57. B-Pref, 2021. URL https://github.com/rll-research/BPref.
  58. Towards more generalizable one-shot visual imitation learning. In Proceedings of the International Conference on Robotics and Automation, pages 2434–2444. IEEE, 2022.
  59. Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards. In Proceedings of the Conference on Robot Learning, pages 485–497. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Katherine Metcalf (16 papers)
  2. Miguel Sarabia (9 papers)
  3. Natalie Mackraz (6 papers)
  4. Barry-John Theobald (34 papers)
Citations (3)