Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Batch Active Learning of Reward Functions from Human Preferences (2402.15757v1)

Published 24 Feb 2024 in cs.LG, cs.AI, cs.RO, and stat.ML

Abstract: Data generation and labeling are often expensive in robot learning. Preference-based learning is a concept that enables reliable labeling by querying users with preference questions. Active querying methods are commonly employed in preference-based learning to generate more informative data at the expense of parallelization and computation time. In this paper, we develop a set of novel algorithms, batch active preference-based learning methods, that enable efficient learning of reward functions using as few data samples as possible while still having short query generation times and also retaining parallelizability. We introduce a method based on determinantal point processes (DPP) for active batch generation and several heuristic-based alternatives. Finally, we present our experimental results for a variety of robotics tasks in simulation. Our results suggest that our batch active learning algorithm requires only a few queries that are computed in a short amount of time. We showcase one of our algorithms in a study to learn human users' preferences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Nir Ailon. 2012. An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. Journal of Machine Learning Research 13, Jan (2012), 137–164.
  2. Keyframe-based learning from demonstration. International Journal of Social Robotics 4, 4 (2012), 343–355.
  3. Preference-based policy learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 12–27.
  4. April: Active preference learning-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 116–131.
  5. Monte Carlo Markov chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. In Conference on Learning Theory. 103–115.
  6. Log-concave polynomials II: high-dimensional walks and an FPRAS for counting bases of a matroid. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. 1–12.
  7. Galen Andrew and Jianfeng Gao. 2007. Scalable training of L 1-regularized log-linear models. In Proceedings of the 24th international conference on Machine learning. ACM, 33–40.
  8. Medoids in almost linear time via multi-armed bandits. arXiv preprint arXiv:1711.00817 (2017).
  9. Learning from physical human corrections, one feature at a time. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. 141–149.
  10. Active Learning of Reward Dynamics from Hierarchical Queries. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). https://doi.org/10.1109/IROS40897.2019.8968522
  11. Do you want your autonomous car to drive like you?. In Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction. ACM, 417–425.
  12. Christian Bauckhage. 2015. Numpy/scipy Recipes for Data Science: k-Medoids Clustering. researchgate.net, Feb (2015).
  13. Dimitris Bertsimas and John Tsitsiklis. 1993. Simulated annealing. Statistical science 8, 1 (1993), 10–15.
  14. Active Preference-Based Gaussian Process Regression for Reward Learning. In Proceedings of Robotics: Science and Systems (RSS).
  15. Active Preference-Based Gaussian Process Regression for Reward Learning and Optimization. The International Journal of Robotics Research (IJRR) (2023).
  16. Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. The International Journal of Robotics Research 41, 1 (2022), 45–67.
  17. Asking Easy Questions: A User-Friendly Approach to Active Reward Learning. In Proceedings of the 3rd Conference on Robot Learning (CoRL).
  18. Erdem Biyik and Dorsa Sadigh. 2018. Batch Active Preference-Based Learning of Reward Functions. In Proceedings of the 2nd Conference on Robot Learning (CoRL) (Proceedings of Machine Learning Research, Vol. 87). PMLR, 519–528.
  19. APReL: A Library for Active Preference-based Reward Learning Algorithms. In Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction. 613–617.
  20. Batch Active Learning Using Determinantal Point Processes. arXiv preprint arXiv:1906.07975 (2019).
  21. Preference Elicitation with Soft Attributes in Interactive Recommendation. arXiv preprint arXiv:2311.02085 (2023).
  22. Max-sum diversification, monotone submodular functions and dynamic updates. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. ACM, 155–166.
  23. Alexei Borodin and Eric M Rains. 2005. Eynard–Mehta theorem, Schur process, and their Pfaffian analogs. Journal of statistical physics 121, 3-4 (2005), 291–317.
  24. Openai gym. arXiv preprint arXiv:1606.01540 (2016).
  25. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In Conference on Robot Learning. PMLR, 330–359.
  26. Ranked batch-mode active learning. Information Sciences 379 (2017), 313–337.
  27. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research (2023).
  28. Local search for max-sum diversification. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 130–142.
  29. Yuxin Chen and Andreas Krause. 2013. Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization. ICML (1) 28 (2013), 160–168.
  30. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems. 4302–4310.
  31. Ali Çivril and Malik Magdon-Ismail. 2009. On selecting a maximum volume sub-matrix of a matrix and related problems. Theoretical Computer Science 410, 47 (2009), 4801.
  32. Ali Civril and Malik Magdon-Ismail. 2013. Exponential inapproximability of selecting a maximum volume sub-matrix. Algorithmica 65, 1 (2013), 159–176.
  33. Thomas M Cover. 1999. Elements of information theory. John Wiley & Sons.
  34. Active learning for probabilistic hypotheses using the maximum Gibbs error criterion. In Advances in Neural Information Processing Systems. 1457–1465.
  35. Preference learning in recommender systems. Preference Learning 41 (2009).
  36. Dissimilarity-based sparse subset selection. IEEE transactions on pattern analysis and machine intelligence 38, 11 (2016), 2182–2197.
  37. A Generalized Acquisition Function for Preference-based Reward Learning. In International Conference on Robotics and Automation (ICRA).
  38. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning 89, 1-2 (2012), 123–156.
  39. Sreenivas Gollapudi and Aneesh Sharma. 2009. An axiomatic approach for result diversification. In Proceedings of the 18th international conference on World wide web. ACM, 381–390.
  40. Yuhong Guo and Dale Schuurmans. 2008. Discriminative batch mode active learning. In Advances in neural information processing systems. 593–600.
  41. An adaptive Metropolis algorithm. Bernoulli 7, 2 (2001), 223–242.
  42. Contrastive Preference Learning: Learning From Human Feedback without RL. In International Conference on Learning Representations (ICLR).
  43. Jonathan Hermon and Justin Salez. 2019. Modified log-Sobolev inequalities for strong-Rayleigh measures. arXiv preprint arXiv:1902.02775 (2019).
  44. Active comparison based learning incorporating user uncertainty and noise. In RSS Workshop on Model Learning for Human-Robot Communication.
  45. Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research 34, 10 (2015), 1296–1313.
  46. Preference-based learning of reward function features. arXiv preprint arXiv:2103.02727 (2021).
  47. Leonard Kaufman and Peter Rousseeuw. 1987. Clustering by means of medoids. North-Holland.
  48. An exact algorithm for maximum entropy sampling. Operations Research 43, 4 (1995), 684–691.
  49. Mykel J Kochenderfer and Tim A Wheeler. 2019. Algorithms for optimization. Mit Press.
  50. Active reinforcement learning: Observing rewards at a cost. In Future of Interactive Learning Machines, NIPS Workshop.
  51. Alex Kulesza and Ben Taskar. 2011. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). 1193–1200.
  52. Alex Kulesza and Ben Taskar. 2012. Determinantal Point Processes for Machine Learning. Now Publishers Inc., Hanover, MA, USA.
  53. When Humans Aren’t Optimal: Robots that Collaborate with Risk-Aware Humans. In ACM/IEEE International Conference on Human-Robot Interaction (HRI). https://doi.org/10.1145/3319502.3374832
  54. B-Pref: Benchmarking Preference-Based Reinforcement Learning. In Neural Information Processing Systems (NeurIPS).
  55. Sergey Levine and Vladlen Koltun. 2012. Continuous inverse optimal control with locally optimal examples. In Proceedings of the 29th International Conference on Machine Learning. 475–482.
  56. Fast mixing markov chains for strongly Rayleigh measures, DPPs, and constrained sampling. In Advances in Neural Information Processing Systems. 4188–4196.
  57. ROIAL: Region of Interest Active Learning for Characterizing Exoskeleton Gait Preference Landscapes. arXiv preprint arXiv:2011.04812 (2020).
  58. Efficient preference-based reinforcement learning using learned dynamics models. arXiv preprint arXiv:2301.04741 (2023).
  59. Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137.
  60. Controlling Assistive Robots with Learned Latent Actions. In International Conference on Robotics and Automation (ICRA). https://doi.org/10.1109/ICRA40945.2020.9197197
  61. Exponentiated Strongly Rayleigh Distributions. In Advances in Neural Information Processing Systems. 4464–4474.
  62. Learning multimodal rewards from rankings. In Conference on Robot Learning. PMLR, 342–352.
  63. Active Reward Learning from Online Preferences. In International Conference on Robotics and Automation (ICRA). https://doi.org/10.1109/ICRA48891.2023.10160439
  64. Aleksandar Nikolov. 2015. Randomized rounding for the largest simplex problem. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, 861–870.
  65. Learning Reward Functions by Integrating Human Demonstrations and Preferences. In Proceedings of Robotics: Science and Systems (RSS). https://doi.org/10.15607/rss.2019.xv.023
  66. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290 (2023).
  67. Dorsa Sadigh. 2017. Safe and Interactive Autonomy: Control, Learning, and Verification. Ph.D. Dissertation. EECS Department, University of California, Berkeley.
  68. Active Preference-Based Learning of Reward Functions. In Proceedings of Robotics: Science and Systems (RSS).
  69. Planning for Autonomous Cars that Leverage Effects on Human Actions.. In Robotics: Science and Systems.
  70. Ozan Sener and Silvio Savarese. 2017. A geometric approach to active learning for convolutional neural networks. arXiv preprint arXiv:1708.00489 (2017).
  71. Interactive robot training for non-markov tasks. arXiv preprint arXiv:2003.02232 (2020).
  72. Preference-learning based inverse reinforcement learning for dialog control. In Thirteenth Annual Conference of the International Speech Communication Association.
  73. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 5026–5033.
  74. Preference-based learning for exoskeleton gait optimization. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2351–2357.
  75. Maximizing global entropy reduction for active learning in speech recognition. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. IEEE, 4721–4724.
  76. T Velmurugan and T Santhanam. 2010. Computational complexity between K-means and K-medoids clustering algorithms for normal and uniform distributions of data points. Journal of computer science 6, 3 (2010), 363.
  77. RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback. arXiv preprint arXiv:2402.03681 (2024).
  78. Submodularity in data subset selection and active learning. In International Conference on Machine Learning. 1954–1963.
  79. Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics bulletin 1, 6 (1945), 80–83.
  80. Nils Wilde and Javier Alonso-Mora. 2022. Do we use the Right Measure? Challenges in Evaluating Reward Learning Algorithms. In 6th Annual Conference on Robot Learning. https://openreview.net/forum?id=1vV0JRA2HY0
  81. Nils Wilde and Erdem Biyik. 2021. Learning Reward Functions from Scale Feedback. In Proceedings of the 5th Conference on Robot Learning (CoRL).
  82. Active Preference Learning using Maximum Regret. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
  83. A bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems. 1133–1141.
  84. Fetch and freight: Standard platforms for service robot applications. In Workshop on autonomous mobile service robots.
  85. Yazhou Yang and Marco Loog. 2019. Single shot active learning using pseudo annotators. Pattern Recognition 89 (2019), 22–31.
  86. Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision 113, 2 (2015), 113–127.
  87. Determinantal point processes for mini-batch diversification. In 33rd Conference on Uncertainty in Artificial Intelligence (UAI). AUAI Press Corvallis.
  88. Active mini-batch sampling using repulsive point processes. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5741–5748.
  89. Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8. Chicago, IL, USA, 1433–1438.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Erdem Bıyık (46 papers)
  2. Nima Anari (43 papers)
  3. Dorsa Sadigh (162 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.