Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning (2405.00746v1)

Published 30 Apr 2024 in cs.LG, cs.AI, and cs.RO

Abstract: To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop (HitL) RL allows agents to learn reward functions from human feedback. Despite recent successes, many of the HitL RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of HitL RL methods (i.e., require less feedback), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based HitL RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with rewards of zero. Through this process, we obtain free reward labels to pre-train our reward model. This pre-training phase provides the reward model a head start in learning, whereby it can identify that low-quality transitions should have a low reward, all without any actual feedback. Through extensive experiments with a simulated teacher, we demonstrate that SDP can significantly improve or achieve competitive performance with state-of-the-art (SOTA) HitL RL algorithms across nine robotic manipulation and locomotion tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Apprenticeship learning via inverse reinforcement learning. In Twenty-First International Conference on Machine Learning, 2004.
  2. Hindsight experience replay. Thirty-First Conference on Neural Information Processing Systems, 2017.
  3. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 2009.
  4. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 2021.
  5. Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. The International Journal of Robotics Research, 2022.
  6. Beyond signal propagation: Is feature diversity necessary in deep neural network initialization? In Thirty-Seventh International Conference on Machine Learning, 2020.
  7. The perils of trial-and-error reward design: Misdesign through overfitting and invalid task specifications. In Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023.
  8. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 1952.
  9. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Thirty-sixth International Conference on Machine Learning, 2019.
  10. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In Third Proceedings of the Conference on Robot Learning, 2020.
  11. Reinforcement learning from demonstration through shaping. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  12. Scaling data-driven robotics with reward sketching and batch reinforcement learning. Robotics: Science and Systems, 2020.
  13. Learning from suboptimal demonstration via self-supervised reward regression. In Forth Conference on Robot learning, 2021.
  14. Deep reinforcement learning from human preferences. In Thirty-First Conference on Neural Information Processing Systems, 2017.
  15. Faulty reward functions in the wild, 2016.
  16. Active reward learning from critiques. In International Conference on Robotics and Automation, 2018.
  17. Active reward learning. In Robotics: Science and Systems, 2014.
  18. Formalizing assistive teleoperation. Robotics: Science and Systems, 2012.
  19. Reinforcement learning from imperfect demonstrations. In Thirty-Fith International Conference on Machine Learning, 2019.
  20. Policy shaping: Integrating human feedback with reinforcement learning. In Twenty-Sixth Conference on Neural Information Processing Systems, 2013.
  21. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Thirty-Fifth International Conference on Machine Learning, 2018.
  22. Few-shot preference learning for human-in-the-loop rl. In Sixth Conference on Robot Learning, 2023.
  23. Deep q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  24. Reward learning from human preferences and demonstrations in atari. In Thirty-Second Conference on Neural Information Processing Systems, 2018.
  25. Interactively shaping agents via human reinforcement. In Fifth International Conference on Knowledge Capture, 2009.
  26. Learning non-myopically from human-generated reward. In International Conference on Intelligent User Interfaces, 2013.
  27. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In Thiry-Eighth International Conference on Machine Learning, 2021.
  28. Reward uncertainty for exploration in preference-based reinforcement learning. In Tenth International Conference on Learning Representations, 2022.
  29. Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Autonomous Agents and Multi-Agent Systems, 2016.
  30. Interactive learning from policy-dependent human feedback. In Thirty-Forth International Conference on Machine Learning, 2017.
  31. Reinforcement teaching. Transactions on Machine Learning Research, 2022.
  32. Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In Tenth International Conference on Learning Representations, 2022.
  33. Burr Settles. Active learning literature survey. 2009.
  34. Inverse reinforcement learning from failure. In Fifteenth International Conference on Autonomous Agents and Multiagent Systems, 2016.
  35. A survey on image data augmentation for deep learning. Journal of Big Data, 2019.
  36. Cog: Connecting new skills to past experience with offline reinforcement learning. Conference on Robot Learning, 2020.
  37. Defining and characterizing reward gaming. In Thirty-Sixth Conference on Neural Information Processing Systems, 2022.
  38. Reinforcement Learning: An Introduction. MIT Press, 2018.
  39. Robust imitation learning from noisy demonstrations. In Twenty-Forth International Conference on Artificial Intelligence and Statistics, 2021.
  40. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  41. Integrating reinforcement learning with human demonstrations of varying ability. In Tenth International Conference on Autonomous Agents and Multiagent Systems, 2011.
  42. Learning via human feedback in continuous state and action spaces. Applied intelligence, 2013.
  43. Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  44. Rating-based reinforcement learning. arXiv preprint arXiv:2307.16348, 2023.
  45. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Third Conference on Robot Learning, 2020.
  46. Conservative data sharing for multi-task offline reinforcement learning. Advances in Neural Information Processing Systems, 2021.
  47. How to leverage unlabeled data in offline reinforcement learning. In Thiry-Ninth International Conference on Machine Learning, 2022.
  48. Zero initialization: Initializing neural networks with only zeros and ones. Transactions on Machine Learning Research, 2022.
  49. Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Calarina Muslimani (4 papers)
  2. Matthew E. Taylor (69 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com