Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Online Policy Learning from Offline Preferences (2403.10160v1)

Published 15 Mar 2024 in cs.LG

Abstract: In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged \emph{offline preferences}, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and \emph{virtual preferences} for PbRL, which are comparisons between the agent's behaviors and the offline data. Critically, the reward function can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases, pages 12–27. Springer Berlin Heidelberg, 2011.
  2. Active preference-based gaussian process regression for reward learning. In Robotics: Science and Systems XVII, 2020.
  3. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  4. Openai gym, 2016.
  5. Preference-based reinforcement learning: Evolutionary direct policy search using a preference-based racing algorithm. Machine Learning, 97(3):327–351, 2014.
  6. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4302–4310. Curran Associates, Inc., 2017.
  7. Pragmatic-pedagogic value alignment. In Proceedings of the Eighteenth International Symposium on Robotics Research, pages 49–57. Springer International Publishing, 2017.
  8. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine Learning, 89(1):123–156, 2012.
  9. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the Thirty-Fifth International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
  10. Reward learning from human preferences and demonstrations in Atari. In Advances in Neural Information Processing Systems, pages 8022–8034. Curran Associates Inc., 2018.
  11. B-pref: Benchmarking preference-based reinforcement learning. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021a.
  12. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In Proceedings of the Thirty-Eighth International Conference on Machine Learning, pages 6152–6163. PMLR, 2021b.
  13. Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
  14. Spectral normalization for generative adversarial networks. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
  15. What matters for adversarial imitation learning? In Advances in Neural Information Processing Systems, pages 14656–14668. Curran Associates, Inc., 2021.
  16. Training language models to follow instructions with human feedback, 2022.
  17. SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In Proceedings of the Tenth International Conference on Learning Representations, 2022.
  18. M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779.
  19. Shared autonomy via deep reinforcement learning. In Proceedings of the Robotics: Science and Systems XIV. MIT Press, 2018.
  20. D. Shin and D. Brown. Offline preference-based apprenticeship learning. In Proceedings of the Workshop on Human-AI Collaboration in Sequential Decision-Making at the Thirty-Eighth International Conference on Machine Learning., 2021.
  21. Reinforcement Learning: An Introduction. A Bradford Book, 2018.
  22. Apprenticeship learning using linear programming. In Proceedings of the Twenty-Fifth International Conference on Machine Learning, pages 1032–1039. Association for Computing Machinery, 2008.
  23. L. L. Thurstone. A law of comparative judgment. Psychological Review, 34(4):273–286, 1927.
  24. Active preference learning using maximum regret. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 10952–10959. IEEE, 2020.
  25. C. Wirth and J. Fürnkranz. On learning from game annotations. IEEE Transactions on Computational Intelligence and AI in Games, 7(3):304–316, 2015.
  26. G. Zhang and H. Kashima. Batch reinforcement learning from crowds. In Machine Learning and Knowledge Discovery in Databases, pages 38–51. Springer Cham, 2023.
  27. Confidence-aware imitation learning from demonstrations with varying optimality. In Advances in Neural Information Processing Systems, pages 12340–12350. Curran Associates, Inc., 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Guoxi Zhang (9 papers)
  2. Han Bao (77 papers)
  3. Hisashi Kashima (63 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets