PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning (2402.15420v1)
Abstract: Preference-based reinforcement learning (RL) has emerged as a new field in robot learning, where humans play a pivotal role in shaping robot behavior by expressing preferences on different sequences of state-action pairs. However, formulating realistic policies for robots demands responses from humans to an extensive array of queries. In this work, we approach the sample-efficiency challenge by expanding the information collected per query to contain both preferences and optional text prompting. To accomplish this, we leverage the zero-shot capabilities of a LLM to reason from the text provided by humans. To accommodate the additional query information, we reformulate the reward learning objectives to contain flexible highlights -- state-action pairs that contain relatively high information and are related to the features processed in a zero-shot fashion from a pretrained LLM. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications. Additionally, the collective feedback collected serves to train a robot on socially compliant trajectories in a simulated social navigation landscape. We provide video examples of the trained policies at https://sites.google.com/view/rl-predilect
- Repeated inverse reinforcement learning. Advances in neural information processing systems 30 (2017).
- Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016).
- Learning from richer human guidance: Augmenting comparison-based learning with feature queries. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. 132–140.
- Erik Båvenstrand and Jakob Berggren. 2019. Performance evaluation of imitation learning algorithms with human experts.
- Preference learning along multiple criteria: A game-theoretic perspective. Advances in neural information processing systems 33 (2020), 7413–7424.
- Active preference-based gaussian process regression for reward learning. In Robotics: Science and Systems (RSS).
- Less is more: Rethinking probabilistic models of human behavior. In Proceedings of the 2020 acm/ieee international conference on human-robot interaction. 429–437.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
- Revisiting Human-Robot Teaching and Learning Through the Lens of Human Concept Learning. In Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction. 147–156.
- Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39, 3/4 (1952), 324–345.
- Openai gym. arXiv preprint arXiv:1606.01540 (2016).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Human preferences for robot-human hand-over configurations. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1986–1993.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Semi-supervised learning under class distribution mismatch. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 3569–3576.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
- Yuchen Cui and Scott Niekum. 2018. Active reward learning from critiques. In 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 6907–6914.
- Integrating state representation learning into deep reinforcement learning. IEEE Robotics and Automation Letters 3, 3 (2018), 1394–1401.
- Causal confusion in imitation learning. Advances in Neural Information Processing Systems 32 (2019).
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Frederick Eberhardt. 2017. Introduction to the foundations of causal discovery. International Journal of Data Science and Analytics 3, 2 (2017), 81–91.
- Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters 5, 2 (2019), 492–499.
- Principles and guidelines for evaluating social robot navigation algorithms. arXiv preprint arXiv:2306.16740 (2023).
- Active transfer learning with zero-shot priors: Reusing past datasets for future tasks. In Proceedings of the IEEE International Conference on Computer Vision. 2731–2739.
- Inverse reward design. Advances in neural information processing systems 30 (2017).
- Donald Joseph Hejna III and Dorsa Sadigh. 2023. Few-shot preference learning for human-in-the-loop rl. In Conference on Robot Learning. PMLR, 2014–2025.
- Dirk Helbing and Péter Molnár. 1995. Social force model for pedestrian dynamics. Physical Review E 51, 5 (1995), 4282–4286.
- Stable Baselines. https://github.com/hill-a/stable-baselines.
- Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071 (2022).
- Explaining Preferences with Shapley Values. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=-me36V0os8P
- Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems 31 (2018).
- Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397 (2016).
- Reward-rational (implicit) choice: A unifying formalism for reward learning. Advances in Neural Information Processing Systems 33 (2020), 4415–4426.
- Natural language for human robot interaction. In International Conference on Human-Robot Interaction (HRI).
- W Bradley Knox and Peter Stone. 2009. Interactively shaping agents via human reinforcement: The TAMER framework. In Proceedings of the fifth international conference on Knowledge capture. 9–16.
- Training a robot via human feedback: A case study. In International Conference on Social Robotics. Springer, 460–470.
- Unsupervised learning of object keypoints for perception and control. Advances in neural information processing systems 32 (2019).
- Using natural language and program abstractions to instill human inductive biases in machines. Advances in Neural Information Processing Systems 35 (2022), 167–180.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091 (2021).
- Intention understanding in human–robot interaction based on visual-NLP semantics. Frontiers in Neurorobotics 14 (2021), 610139.
- Reward Uncertainty for Exploration in Preference-based Reinforcement Learning. In International Conference on Learning Representations. https://openreview.net/forum?id=OWZVD-l-ZrC
- Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=OZKBReUF-wX
- Interactive learning from policy-dependent human feedback. In Int. Conf. on Machine Learning. PMLR, 2285–2294.
- Sim-to-real reinforcement learning for deformable object manipulation. In Conference on Robot Learning. PMLR, 734–743.
- Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016).
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021).
- Anis Najar and Mohamed Chetouani. 2021. Reinforcement learning with human advice: a survey. Frontiers in Robotics and AI 8 (2021), 584075.
- Multimodal deep learning. In ICML.
- SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning. In International Conference on Learning Representations. https://openreview.net/forum?id=TfhfZLQ2EJO
- PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035.
- Natural language for human-robot collaboration: Problems beyond language grounding. arXiv preprint arXiv:2110.04441 (2021).
- Lerrel Pinto and Abhinav Gupta. 2017. Learning to push by grasping: Using multiple tasks for effective learning. In 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2161–2168.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
- Data distillation: Towards omni-supervised learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4119–4128.
- Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018).
- Active preference-based learning of reward functions.
- MIND MELD: Personalized Meta-Learning for Robot-Centric Imitation Learning.. In HRI. 157–165.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
- Zero shot transfer learning for robot soccer. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 2070–2072.
- Supervised autonomy for online learning in human-robot interaction. Pattern Recognition Letters 99 (2017), 77–86.
- Offline Reinforcement Learning for Visual Navigation. In 6th Annual Conference on Robot Learning. https://openreview.net/forum?id=uhIfIEIiWm_
- Correcting robot plans with natural language feedback. In Robotics: Science and Systems (RSS).
- Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307 (2016).
- Meta-transfer learning for zero-shot super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3516–3525.
- Causation, prediction, and search. MIT press.
- How to talk so AI will learn: Instructions, descriptions, and autonomy. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=ZLsZmNe1RDb
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022).
- A Study of Causal Confusion in Preference-Based Reward Learning. arXiv preprint arXiv:2204.06601 (2022).
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200–212.
- Skill preferences: Learning to extract and execute robotic skills from human feedback. In Conference on Robot Learning. PMLR, 1259–1268.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- A bayesian approach for policy learning from trajectory preference queries. Advances in neural information processing systems 25 (2012).
- A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research 18, 136 (2017), 1–46.
- Few-shot goal inference for visuomotor learning and planning. In Conference on Robot Learning. PMLR, 40–52.
- Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10687–10698.
- Transfer learning via learning to transfer. In International Conference on Machine Learning. PMLR, 5085–5094.
- Xirl: Cross-embodiment inverse reinforcement learning. In Conference on Robot Learning. PMLR, 537–546.
- Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. In The Eleventh International Conference on Learning Representations.
- LiT: Zero-Shot Transfer with Locked-image Text Tuning. arXiv preprint arXiv:2111.07991 (2021).
- A Dual Representation Framework for Robot Learning with Human Guidance. In 6th Annual Conference on Robot Learning. https://openreview.net/forum?id=H6rr_CGzV9y
- Simon Holk (3 papers)
- Daniel Marta (3 papers)
- Iolanda Leite (29 papers)