Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning (2402.15420v1)

Published 23 Feb 2024 in cs.RO, cs.CL, and cs.LG

Abstract: Preference-based reinforcement learning (RL) has emerged as a new field in robot learning, where humans play a pivotal role in shaping robot behavior by expressing preferences on different sequences of state-action pairs. However, formulating realistic policies for robots demands responses from humans to an extensive array of queries. In this work, we approach the sample-efficiency challenge by expanding the information collected per query to contain both preferences and optional text prompting. To accomplish this, we leverage the zero-shot capabilities of a LLM to reason from the text provided by humans. To accommodate the additional query information, we reformulate the reward learning objectives to contain flexible highlights -- state-action pairs that contain relatively high information and are related to the features processed in a zero-shot fashion from a pretrained LLM. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications. Additionally, the collective feedback collected serves to train a robot on socially compliant trajectories in a simulated social navigation landscape. We provide video examples of the trained policies at https://sites.google.com/view/rl-predilect

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Repeated inverse reinforcement learning. Advances in neural information processing systems 30 (2017).
  2. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016).
  3. Learning from richer human guidance: Augmenting comparison-based learning with feature queries. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. 132–140.
  4. Erik Båvenstrand and Jakob Berggren. 2019. Performance evaluation of imitation learning algorithms with human experts.
  5. Preference learning along multiple criteria: A game-theoretic perspective. Advances in neural information processing systems 33 (2020), 7413–7424.
  6. Active preference-based gaussian process regression for reward learning. In Robotics: Science and Systems (RSS).
  7. Less is more: Rethinking probabilistic models of human behavior. In Proceedings of the 2020 acm/ieee international conference on human-robot interaction. 429–437.
  8. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  9. Revisiting Human-Robot Teaching and Learning Through the Lens of Human Concept Learning. In Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction. 147–156.
  10. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39, 3/4 (1952), 324–345.
  11. Openai gym. arXiv preprint arXiv:1606.01540 (2016).
  12. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  13. Human preferences for robot-human hand-over configurations. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1986–1993.
  14. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  15. Semi-supervised learning under class distribution mismatch. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 3569–3576.
  16. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
  17. Yuchen Cui and Scott Niekum. 2018. Active reward learning from critiques. In 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 6907–6914.
  18. Integrating state representation learning into deep reinforcement learning. IEEE Robotics and Automation Letters 3, 3 (2018), 1394–1401.
  19. Causal confusion in imitation learning. Advances in Neural Information Processing Systems 32 (2019).
  20. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  21. Frederick Eberhardt. 2017. Introduction to the foundations of causal discovery. International Journal of Data Science and Analytics 3, 2 (2017), 81–91.
  22. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters 5, 2 (2019), 492–499.
  23. Principles and guidelines for evaluating social robot navigation algorithms. arXiv preprint arXiv:2306.16740 (2023).
  24. Active transfer learning with zero-shot priors: Reusing past datasets for future tasks. In Proceedings of the IEEE International Conference on Computer Vision. 2731–2739.
  25. Inverse reward design. Advances in neural information processing systems 30 (2017).
  26. Donald Joseph Hejna III and Dorsa Sadigh. 2023. Few-shot preference learning for human-in-the-loop rl. In Conference on Robot Learning. PMLR, 2014–2025.
  27. Dirk Helbing and Péter Molnár. 1995. Social force model for pedestrian dynamics. Physical Review E 51, 5 (1995), 4282–4286.
  28. Stable Baselines. https://github.com/hill-a/stable-baselines.
  29. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071 (2022).
  30. Explaining Preferences with Shapley Values. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=-me36V0os8P
  31. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems 31 (2018).
  32. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397 (2016).
  33. Reward-rational (implicit) choice: A unifying formalism for reward learning. Advances in Neural Information Processing Systems 33 (2020), 4415–4426.
  34. Natural language for human robot interaction. In International Conference on Human-Robot Interaction (HRI).
  35. W Bradley Knox and Peter Stone. 2009. Interactively shaping agents via human reinforcement: The TAMER framework. In Proceedings of the fifth international conference on Knowledge capture. 9–16.
  36. Training a robot via human feedback: A case study. In International Conference on Social Robotics. Springer, 460–470.
  37. Unsupervised learning of object keypoints for perception and control. Advances in neural information processing systems 32 (2019).
  38. Using natural language and program abstractions to instill human inductive biases in machines. Advances in Neural Information Processing Systems 35 (2022), 167–180.
  39. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091 (2021).
  40. Intention understanding in human–robot interaction based on visual-NLP semantics. Frontiers in Neurorobotics 14 (2021), 610139.
  41. Reward Uncertainty for Exploration in Preference-based Reinforcement Learning. In International Conference on Learning Representations. https://openreview.net/forum?id=OWZVD-l-ZrC
  42. Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=OZKBReUF-wX
  43. Interactive learning from policy-dependent human feedback. In Int. Conf. on Machine Learning. PMLR, 2285–2294.
  44. Sim-to-real reinforcement learning for deformable object manipulation. In Conference on Robot Learning. PMLR, 734–743.
  45. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016).
  46. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021).
  47. Anis Najar and Mohamed Chetouani. 2021. Reinforcement learning with human advice: a survey. Frontiers in Robotics and AI 8 (2021), 584075.
  48. Multimodal deep learning. In ICML.
  49. SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning. In International Conference on Learning Representations. https://openreview.net/forum?id=TfhfZLQ2EJO
  50. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035.
  51. Natural language for human-robot collaboration: Problems beyond language grounding. arXiv preprint arXiv:2110.04441 (2021).
  52. Lerrel Pinto and Abhinav Gupta. 2017. Learning to push by grasping: Using multiple tasks for effective learning. In 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2161–2168.
  53. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
  54. Data distillation: Towards omni-supervised learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4119–4128.
  55. Know what you don’t know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018).
  56. Active preference-based learning of reward functions.
  57. MIND MELD: Personalized Meta-Learning for Robot-Centric Imitation Learning.. In HRI. 157–165.
  58. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
  59. Zero shot transfer learning for robot soccer. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 2070–2072.
  60. Supervised autonomy for online learning in human-robot interaction. Pattern Recognition Letters 99 (2017), 77–86.
  61. Offline Reinforcement Learning for Visual Navigation. In 6th Annual Conference on Robot Learning. https://openreview.net/forum?id=uhIfIEIiWm_
  62. Correcting robot plans with natural language feedback. In Robotics: Science and Systems (RSS).
  63. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307 (2016).
  64. Meta-transfer learning for zero-shot super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3516–3525.
  65. Causation, prediction, and search. MIT press.
  66. How to talk so AI will learn: Instructions, descriptions, and autonomy. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=ZLsZmNe1RDb
  67. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022).
  68. A Study of Causal Confusion in Preference-Based Reward Learning. arXiv preprint arXiv:2204.06601 (2022).
  69. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems 34 (2021), 200–212.
  70. Skill preferences: Learning to extract and execute robotic skills from human feedback. In Conference on Robot Learning. PMLR, 1259–1268.
  71. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  72. A bayesian approach for policy learning from trajectory preference queries. Advances in neural information processing systems 25 (2012).
  73. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research 18, 136 (2017), 1–46.
  74. Few-shot goal inference for visuomotor learning and planning. In Conference on Robot Learning. PMLR, 40–52.
  75. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10687–10698.
  76. Transfer learning via learning to transfer. In International Conference on Machine Learning. PMLR, 5085–5094.
  77. Xirl: Cross-embodiment inverse reinforcement learning. In Conference on Robot Learning. PMLR, 537–546.
  78. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. In The Eleventh International Conference on Learning Representations.
  79. LiT: Zero-Shot Transfer with Locked-image Text Tuning. arXiv preprint arXiv:2111.07991 (2021).
  80. A Dual Representation Framework for Robot Learning with Human Guidance. In 6th Annual Conference on Robot Learning. https://openreview.net/forum?id=H6rr_CGzV9y
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Simon Holk (3 papers)
  2. Daniel Marta (3 papers)
  3. Iolanda Leite (29 papers)
Citations (4)