RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences (2402.17257v4)
Abstract: Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL methods excessively depend on high-quality feedback from domain experts, which results in a lack of robustness. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences. Our method utilizes a sample selection-based discriminator to dynamically filter out noise and ensure robust training. To counteract the cumulative error stemming from incorrect selection, we suggest a warm start for the reward model, which additionally bridges the performance gap during the transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the state-of-the-art PbRL method. Code is available at https://github.com/CJReinforce/RIME_ICML2024.
- Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. Advances in Neural Information Processing Systems, 35:28955–28971, 2022.
- Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011. Proceedings, Part I 11, pp. 12–27. Springer, 2011.
- Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836):77–82, 2020.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Towards human-level bimanual dexterous manipulation with reinforcement learning. Advances in Neural Information Processing Systems, 35:5150–5163, 2022.
- Weakly supervised learning with side information for noisy labeled images. In European Conference on Computer Vision, pp. 306–321, 2020.
- Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
- Can cross entropy loss be robust to label noise? In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 2206–2212, 2021.
- Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
- Training deep neural-networks using a noise adaptation layer. In International Conference on Learning Representations, 2016.
- Semi-supervised learning by entropy minimization. Advances in Neural Information Processing Systems, 17, 2004.
- Few-shot preference learning for human-in-the-loop rl. In Conference on Robot Learning, pp. 2014–2025. PMLR, 2023.
- Reward learning from human preferences and demonstrations in atari. Advances in Neural Information Processing Systems, 31, 2018.
- Champion-level drone racing using deep reinforcement learning. Nature, 620(7976):982–987, 2023.
- Preference transformer: Modeling human preferences using transformers for rl. In International Conference on Learning Representations, 2022.
- B-pref: Benchmarking preference-based reinforcement learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021a.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In International Conference on Machine Learning, pp. 6152–6163. PMLR, 2021b.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 4313–4324. PMLR, 2020.
- Reward uncertainty for exploration in preference-based reinforcement learning. In International Conference on Learning Representations, 2021.
- Behavior from the void: Unsupervised active pre-training. Advances in Neural Information Processing Systems, 34:18459–18473, 2021.
- Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. Advances in Neural Information Processing Systems, 35:22270–22284, 2022.
- Does label smoothing mitigate label noise? In International Conference on Machine Learning, pp. 6448–6458. PMLR, 2020.
- Curriculum loss: Robust learning and generalization against label corruption. In International Conference on Learning Representations, 2019.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Surf: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In International Conference on Learning Representations, 2021.
- Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996, 2022.
- Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(3-4):301–321, 2003.
- Robust learning by self-transition for handling noisy labels. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 1490–1500, 2021.
- Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Reinforcement learning: An introduction. MIT press, 2018.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- dm_control: Software and tasks for continuous control. arXiv preprint arXiv:2006.12983, 2020.
- Jump-start reinforcement learning. In International Conference on Machine Learning, pp. 34556–34583. PMLR, 2023.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Denoising implicit feedback for recommendation. In Proceedings of the 14th ACM international conference on web search and data mining, pp. 373–381, 2021.
- To smooth or not? when label smoothing meets noisy labels. arXiv preprint arXiv:2106.04149, 2021.
- Robust early-learning: Hindering the memorization of noisy labels. In International Conference on Learning Representations, 2020.
- Reinforcement learning from diverse human preferences. arXiv preprint arXiv:2301.11774, 2023.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pp. 1094–1100. PMLR, 2020.
- Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in Neural Information Processing Systems, 31, 2018.
- Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving. arXiv preprint arXiv:2010.09776, 2020.
- Jie Cheng (80 papers)
- Gang Xiong (37 papers)
- Xingyuan Dai (14 papers)
- Qinghai Miao (5 papers)
- Yisheng Lv (26 papers)
- Fei-Yue Wang (72 papers)