Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inverse Preference Learning: Preference-based RL without a Reward Function (2305.15363v2)

Published 24 May 2023 in cs.LG

Abstract: Reward functions are difficult to design and often hard to align with human intent. Preference-based Reinforcement Learning (RL) algorithms address these problems by learning reward functions from human feedback. However, the majority of preference-based RL methods na\"ively combine supervised reward models with off-the-shelf RL algorithms. Contemporary approaches have sought to improve performance and query complexity by using larger and more complex reward architectures such as transformers. Instead of using highly complex architectures, we develop a new and parameter-efficient algorithm, Inverse Preference Learning (IPL), specifically designed for learning from offline preference data. Our key insight is that for a fixed policy, the $Q$-function encodes all information about the reward function, effectively making them interchangeable. Using this insight, we completely eliminate the need for a learned reward function. Our resulting algorithm is simpler and more parameter-efficient. Across a suite of continuous control and robotics benchmarks, IPL attains competitive performance compared to more complex approaches that leverage transformer-based and non-Markovian reward functions while having fewer algorithmic hyperparameters and learned network parameters. Our code is publicly released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning, 2004.
  2. Keyframe-based learning from demonstration. International Journal of Social Robotics, 4(4):343–355, 2012.
  3. Preference-based policy learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2011.
  4. LS-IQ: Implicit reward regularization for inverse reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=o3Q4m8jg4BR.
  5. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
  6. Rewarding behaviors. In National Conference on Artificial Intelligence, 1996.
  7. Do you want your autonomous car to drive like you? In 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI, pages 417–425. IEEE, 2017.
  8. The green choice: Learning and influencing human decisions on shared roads. In 2019 IEEE 58th conference on decision and control (CDC), pages 347–354. IEEE, 2019.
  9. Active preference-based gaussian process regression for reward learning. In Proceedings of Robotics: Science and Systems (RSS), July 2020.
  10. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  11. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR, 2019.
  12. Safe imitation learning via fast bayesian reward inference from preferences. In International Conference on Machine Learning, pages 1165–1177. PMLR, 2020.
  13. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017.
  14. Active reward learning with a novel acquisition function. Autonomous Robots, 39(3):389–405, 2015.
  15. Non-markovian reward modelling from trajectory labels via interpretable multiple instance learning. In Advances in Neural Information Processing Systems, 2022.
  16. Implementation matters in deep rl: A case study on ppo and trpo. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1etN1rtPB.
  17. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  18. Iq-learn: Inverse soft-q learning for imitation. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Aeo-xqtb5p.
  19. Extreme q-learning: Maxent RL without entropy. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SJ0Lde3tRL.
  20. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
  21. Inverse reward design. Advances in neural information processing systems, 30, 2017.
  22. Few-shot preference learning for human-in-the-loop RL. In Conference on Robot Learning, 2022.
  23. Contrastive preference learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639, 2023.
  24. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
  25. Reward learning from human preferences and demonstrations in atari. In Advances in Neural Information Processing Systems, 2018.
  26. Beyond reward: Offline preference-guided policy optimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15753–15768. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/kang23b.html.
  27. Data-driven motion mappings improve transparency in teleoperation. Presence, 24(2):132–154, 2015.
  28. Preference transformer: Modeling human preferences using transformers for rl. In International Conference on Learning Representations, 2023.
  29. Tamer: Training an agent manually via evaluative reinforcement. In 2008 7th IEEE international conference on development and learning, pages 292–297. IEEE, 2008.
  30. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022.
  31. When humans aren’t optimal: Robots that collaborate with risk-aware humans. In 2020 15th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 43–52. IEEE, 2020.
  32. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In International Conference on Machine Learning, 2021a.
  33. B-pref: Benchmarking preference-based reinforcement learning. In Conference on Neural Information Processing Systems Datasets and Benchmarks Track (round 1), 2021b.
  34. Inferring rewards from language in context. arXiv preprint arXiv:2204.02515, 2022.
  35. Controlling assistive robots with learned latent actions. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 378–384. IEEE, 2020.
  36. What matters in learning from offline human demonstrations for robot manipulation. In Conference on Robot Learning (CoRL), 2021.
  37. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  38. Learning multimodal rewards from rankings. In Conference on Robot Learning, pages 342–352. PMLR, 2022.
  39. {AWAC}: Accelerating online reinforcement learning with offline datasets, 2021. URL https://openreview.net/forum?id=OJiM1R3jAtZ.
  40. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  41. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, 2000.
  42. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  43. Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In International Conference on Learning Representations, 2022.
  44. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  45. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  46. Bayesian inverse reinforcement learning. In IJCAI, volume 7, pages 2586–2591, 2007.
  47. Active preference-based learning of reward functions. In Robotics: Science and Systems, 2017.
  48. C. Schenck and D. Fox. Visual closed-loop control for pouring liquids. In International Conference on Robotics and Automation, 2017.
  49. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  50. Offline preference-based apprenticeship learning. arXiv preprint arXiv:2107.09251, 2021.
  51. Learning to summarize from human feedback. arXiv preprint arXiv:2009.01325, 2020.
  52. Reinforcement learning: An introduction. MIT Press, 2018.
  53. A bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems, 2012.
  54. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
  55. Offline rl with no ood actions: In-sample learning via implicit value regularization. In International Conference on Learning Representations, 2023.
  56. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, 2020.
  57. The ingredients of real world robotic reinforcement learning. In International Conference on Learning Representations, 2020.
  58. Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
  59. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Joey Hejna (19 papers)
  2. Dorsa Sadigh (162 papers)
Citations (36)