Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Crowd-PrefRL: Preference-Based Reward Learning from Crowds (2401.10941v1)

Published 17 Jan 2024 in cs.HC, cs.LG, and cs.SI

Abstract: Preference-based reinforcement learning (RL) provides a framework to train agents using human feedback through pairwise preferences over pairs of behaviors, enabling agents to learn desired behaviors when it is difficult to specify a numerical reward function. While this paradigm leverages human feedback, it currently treats the feedback as given by a single human user. Meanwhile, incorporating preference feedback from crowds (i.e. ensembles of users) in a robust manner remains a challenge, and the problem of training RL agents using feedback from multiple human users remains understudied. In this work, we introduce Crowd-PrefRL, a framework for performing preference-based RL leveraging feedback from crowds. This work demonstrates the viability of learning reward functions from preference feedback provided by crowds of unknown expertise and reliability. Crowd-PrefRL not only robustly aggregates the crowd preference feedback, but also estimates the reliability of each user within the crowd using only the (noisy) crowdsourced preference comparisons. Most importantly, we show that agents trained with Crowd-PrefRL outperform agents trained with majority-vote preferences or preferences from any individual user in most cases, especially when the spread of user error rates among the crowd is large. Results further suggest that our method can identify minority viewpoints within the crowd.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. 2021. Deep reinforcement learning at the edge of the statistical precipice. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems, volume 34, 29304–29320. Curran Associates, Inc.
  2. 2011. Apprenticeship learning about multiple intentions. In Getoor, L., and Scheffer, T., eds., Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, 897–904. New York, NY, USA: ACM.
  3. 2022. Imitation learning by estimating expertise of demonstrators. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; and Sabato, S., eds., Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, 1732–1748. PMLR.
  4. 2019. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Proceedings of the International Conference on Machine Learning, 783–792.
  5. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research.
  6. 2013. Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, 193–202. New York, NY, USA: Association for Computing Machinery.
  7. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30.
  8. 2012. Bayesian multitask inverse reinforcement learning. In Sanner, S., and Hutter, M., eds., Recent Advances in Reinforcement Learning, 273–284. Berlin, Heidelberg: Springer Berlin Heidelberg.
  9. 2017. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 1235–1245. Red Hook, NY, USA: Curran Associates Inc.
  10. 2021. B-pref: Benchmarking preference-based reinforcement learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  11. 2021. PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In International Conference on Machine Learning, 6152–6163.
  12. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Computing Research Repository (CoRR) abs/2005.01643.
  13. 2016. Crowdsourced data management: A survey. IEEE Transactions on Knowledge and Data Engineering 28(9):2296–2319.
  14. 2022. Training language models to follow instructions with human feedback. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 27730–27744. Curran Associates, Inc.
  15. 2014. Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences 111(4):1253–1258.
  16. 2022. SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In International Conference on Learning Representations.
  17. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  18. 2023. Distributional preference learning: Understanding and accounting for hidden context in RLHF. arXiv preprint arXiv:2312.08358.
  19. 2018. Reinforcement learning: An introduction. MIT press.
  20. 2018. Deepmind control suite. Computing Research Repository (CoRR) abs/1801.00690.
  21. 2023. Breadcrumbs to the goal: Goal-conditioned exploration from human-in-the-loop feedback. arXiv preprint arXiv:2307.11049.
  22. 2021. Learning reward functions from scale feedback. In 5th Annual Conference on Robot Learning.
  23. 2023. Batch reinforcement learning from crowds. In Amini, M.-R.; Canu, S.; Fischer, A.; Guns, T.; Kralj Novak, P.; and Tsoumakas, G., eds., Machine Learning and Knowledge Discovery in Databases, 38–51. Cham: Springer Nature Switzerland.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. David Chhan (1 paper)
  2. Ellen Novoseller (20 papers)
  3. Vernon J. Lawhern (17 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com