Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences (2402.17257v4)

Published 27 Feb 2024 in cs.LG, cs.AI, and cs.RO

Abstract: Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL methods excessively depend on high-quality feedback from domain experts, which results in a lack of robustness. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences. Our method utilizes a sample selection-based discriminator to dynamically filter out noise and ensure robust training. To counteract the cumulative error stemming from incorrect selection, we suggest a warm start for the reward model, which additionally bridges the performance gap during the transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the state-of-the-art PbRL method. Code is available at https://github.com/CJReinforce/RIME_ICML2024.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in Neural Information Processing Systems, 35:28955–28971, 2022.
  2. Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011. Proceedings, Part I 11, pp.  12–27. Springer, 2011.
  3. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836):77–82, 2020.
  4. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  5. Towards human-level bimanual dexterous manipulation with reinforcement learning. Advances in Neural Information Processing Systems, 35:5150–5163, 2022.
  6. Weakly supervised learning with side information for noisy labeled images. In European Conference on Computer Vision, pp.  306–321, 2020.
  7. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
  8. Can cross entropy loss be robust to label noise? In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp.  2206–2212, 2021.
  9. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  10. Training deep neural-networks using a noise adaptation layer. In International Conference on Learning Representations, 2016.
  11. Semi-supervised learning by entropy minimization. Advances in Neural Information Processing Systems, 17, 2004.
  12. Few-shot preference learning for human-in-the-loop rl. In Conference on Robot Learning, pp.  2014–2025. PMLR, 2023.
  13. Reward learning from human preferences and demonstrations in atari. Advances in Neural Information Processing Systems, 31, 2018.
  14. Champion-level drone racing using deep reinforcement learning. Nature, 620(7976):982–987, 2023.
  15. Preference transformer: Modeling human preferences using transformers for rl. In International Conference on Learning Representations, 2022.
  16. B-pref: Benchmarking preference-based reinforcement learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021a.
  17. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In International Conference on Machine Learning, pp.  6152–6163. PMLR, 2021b.
  18. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  19. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International Conference on Artificial Intelligence and Statistics, pp.  4313–4324. PMLR, 2020.
  20. Reward uncertainty for exploration in preference-based reinforcement learning. In International Conference on Learning Representations, 2021.
  21. Behavior from the void: Unsupervised active pre-training. Advances in Neural Information Processing Systems, 34:18459–18473, 2021.
  22. Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. Advances in Neural Information Processing Systems, 35:22270–22284, 2022.
  23. Does label smoothing mitigate label noise? In International Conference on Machine Learning, pp.  6448–6458. PMLR, 2020.
  24. Curriculum loss: Robust learning and generalization against label corruption. In International Conference on Learning Representations, 2019.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In International Conference on Learning Representations, 2021.
  27. Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996, 2022.
  28. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(3-4):301–321, 2003.
  29. Robust learning by self-transition for handling noisy labels. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.  1490–1500, 2021.
  30. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  31. Reinforcement learning: An introduction. MIT press, 2018.
  32. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  33. dm_control: Software and tasks for continuous control. arXiv preprint arXiv:2006.12983, 2020.
  34. Jump-start reinforcement learning. In International Conference on Machine Learning, pp.  34556–34583. PMLR, 2023.
  35. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  36. Denoising implicit feedback for recommendation. In Proceedings of the 14th ACM international conference on web search and data mining, pp.  373–381, 2021.
  37. To smooth or not? when label smoothing meets noisy labels. arXiv preprint arXiv:2106.04149, 2021.
  38. Robust early-learning: Hindering the memorization of noisy labels. In International Conference on Learning Representations, 2020.
  39. Reinforcement learning from diverse human preferences. arXiv preprint arXiv:2301.11774, 2023.
  40. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pp.  1094–1100. PMLR, 2020.
  41. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in Neural Information Processing Systems, 31, 2018.
  42. Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving. arXiv preprint arXiv:2010.09776, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jie Cheng (80 papers)
  2. Gang Xiong (37 papers)
  3. Xingyuan Dai (14 papers)
  4. Qinghai Miao (5 papers)
  5. Yisheng Lv (26 papers)
  6. Fei-Yue Wang (72 papers)
Citations (4)