Online Policy Learning from Offline Preferences (2403.10160v1)
Abstract: In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged \emph{offline preferences}, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and \emph{virtual preferences} for PbRL, which are comparisons between the agent's behaviors and the offline data. Critically, the reward function can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.
- Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases, pages 12–27. Springer Berlin Heidelberg, 2011.
- Active preference-based gaussian process regression for reward learning. In Robotics: Science and Systems XVII, 2020.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Openai gym, 2016.
- Preference-based reinforcement learning: Evolutionary direct policy search using a preference-based racing algorithm. Machine Learning, 97(3):327–351, 2014.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4302–4310. Curran Associates, Inc., 2017.
- Pragmatic-pedagogic value alignment. In Proceedings of the Eighteenth International Symposium on Robotics Research, pages 49–57. Springer International Publishing, 2017.
- Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine Learning, 89(1):123–156, 2012.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the Thirty-Fifth International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
- Reward learning from human preferences and demonstrations in Atari. In Advances in Neural Information Processing Systems, pages 8022–8034. Curran Associates Inc., 2018.
- B-pref: Benchmarking preference-based reinforcement learning. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021a.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In Proceedings of the Thirty-Eighth International Conference on Machine Learning, pages 6152–6163. PMLR, 2021b.
- Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
- Spectral normalization for generative adversarial networks. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
- What matters for adversarial imitation learning? In Advances in Neural Information Processing Systems, pages 14656–14668. Curran Associates, Inc., 2021.
- Training language models to follow instructions with human feedback, 2022.
- SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In Proceedings of the Tenth International Conference on Learning Representations, 2022.
- M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779.
- Shared autonomy via deep reinforcement learning. In Proceedings of the Robotics: Science and Systems XIV. MIT Press, 2018.
- D. Shin and D. Brown. Offline preference-based apprenticeship learning. In Proceedings of the Workshop on Human-AI Collaboration in Sequential Decision-Making at the Thirty-Eighth International Conference on Machine Learning., 2021.
- Reinforcement Learning: An Introduction. A Bradford Book, 2018.
- Apprenticeship learning using linear programming. In Proceedings of the Twenty-Fifth International Conference on Machine Learning, pages 1032–1039. Association for Computing Machinery, 2008.
- L. L. Thurstone. A law of comparative judgment. Psychological Review, 34(4):273–286, 1927.
- Active preference learning using maximum regret. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 10952–10959. IEEE, 2020.
- C. Wirth and J. Fürnkranz. On learning from game annotations. IEEE Transactions on Computational Intelligence and AI in Games, 7(3):304–316, 2015.
- G. Zhang and H. Kashima. Batch reinforcement learning from crowds. In Machine Learning and Knowledge Discovery in Databases, pages 38–51. Springer Cham, 2023.
- Confidence-aware imitation learning from demonstrations with varying optimality. In Advances in Neural Information Processing Systems, pages 12340–12350. Curran Associates, Inc., 2021.
- Guoxi Zhang (9 papers)
- Han Bao (77 papers)
- Hisashi Kashima (63 papers)