Batch Active Learning of Reward Functions from Human Preferences (2402.15757v1)
Abstract: Data generation and labeling are often expensive in robot learning. Preference-based learning is a concept that enables reliable labeling by querying users with preference questions. Active querying methods are commonly employed in preference-based learning to generate more informative data at the expense of parallelization and computation time. In this paper, we develop a set of novel algorithms, batch active preference-based learning methods, that enable efficient learning of reward functions using as few data samples as possible while still having short query generation times and also retaining parallelizability. We introduce a method based on determinantal point processes (DPP) for active batch generation and several heuristic-based alternatives. Finally, we present our experimental results for a variety of robotics tasks in simulation. Our results suggest that our batch active learning algorithm requires only a few queries that are computed in a short amount of time. We showcase one of our algorithms in a study to learn human users' preferences.
- Nir Ailon. 2012. An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. Journal of Machine Learning Research 13, Jan (2012), 137–164.
- Keyframe-based learning from demonstration. International Journal of Social Robotics 4, 4 (2012), 343–355.
- Preference-based policy learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 12–27.
- April: Active preference learning-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 116–131.
- Monte Carlo Markov chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. In Conference on Learning Theory. 103–115.
- Log-concave polynomials II: high-dimensional walks and an FPRAS for counting bases of a matroid. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. 1–12.
- Galen Andrew and Jianfeng Gao. 2007. Scalable training of L 1-regularized log-linear models. In Proceedings of the 24th international conference on Machine learning. ACM, 33–40.
- Medoids in almost linear time via multi-armed bandits. arXiv preprint arXiv:1711.00817 (2017).
- Learning from physical human corrections, one feature at a time. In Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. 141–149.
- Active Learning of Reward Dynamics from Hierarchical Queries. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). https://doi.org/10.1109/IROS40897.2019.8968522
- Do you want your autonomous car to drive like you?. In Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction. ACM, 417–425.
- Christian Bauckhage. 2015. Numpy/scipy Recipes for Data Science: k-Medoids Clustering. researchgate.net, Feb (2015).
- Dimitris Bertsimas and John Tsitsiklis. 1993. Simulated annealing. Statistical science 8, 1 (1993), 10–15.
- Active Preference-Based Gaussian Process Regression for Reward Learning. In Proceedings of Robotics: Science and Systems (RSS).
- Active Preference-Based Gaussian Process Regression for Reward Learning and Optimization. The International Journal of Robotics Research (IJRR) (2023).
- Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. The International Journal of Robotics Research 41, 1 (2022), 45–67.
- Asking Easy Questions: A User-Friendly Approach to Active Reward Learning. In Proceedings of the 3rd Conference on Robot Learning (CoRL).
- Erdem Biyik and Dorsa Sadigh. 2018. Batch Active Preference-Based Learning of Reward Functions. In Proceedings of the 2nd Conference on Robot Learning (CoRL) (Proceedings of Machine Learning Research, Vol. 87). PMLR, 519–528.
- APReL: A Library for Active Preference-based Reward Learning Algorithms. In Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction. 613–617.
- Batch Active Learning Using Determinantal Point Processes. arXiv preprint arXiv:1906.07975 (2019).
- Preference Elicitation with Soft Attributes in Interactive Recommendation. arXiv preprint arXiv:2311.02085 (2023).
- Max-sum diversification, monotone submodular functions and dynamic updates. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. ACM, 155–166.
- Alexei Borodin and Eric M Rains. 2005. Eynard–Mehta theorem, Schur process, and their Pfaffian analogs. Journal of statistical physics 121, 3-4 (2005), 291–317.
- Openai gym. arXiv preprint arXiv:1606.01540 (2016).
- Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In Conference on Robot Learning. PMLR, 330–359.
- Ranked batch-mode active learning. Information Sciences 379 (2017), 313–337.
- Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research (2023).
- Local search for max-sum diversification. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 130–142.
- Yuxin Chen and Andreas Krause. 2013. Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization. ICML (1) 28 (2013), 160–168.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems. 4302–4310.
- Ali Çivril and Malik Magdon-Ismail. 2009. On selecting a maximum volume sub-matrix of a matrix and related problems. Theoretical Computer Science 410, 47 (2009), 4801.
- Ali Civril and Malik Magdon-Ismail. 2013. Exponential inapproximability of selecting a maximum volume sub-matrix. Algorithmica 65, 1 (2013), 159–176.
- Thomas M Cover. 1999. Elements of information theory. John Wiley & Sons.
- Active learning for probabilistic hypotheses using the maximum Gibbs error criterion. In Advances in Neural Information Processing Systems. 1457–1465.
- Preference learning in recommender systems. Preference Learning 41 (2009).
- Dissimilarity-based sparse subset selection. IEEE transactions on pattern analysis and machine intelligence 38, 11 (2016), 2182–2197.
- A Generalized Acquisition Function for Preference-based Reward Learning. In International Conference on Robotics and Automation (ICRA).
- Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning 89, 1-2 (2012), 123–156.
- Sreenivas Gollapudi and Aneesh Sharma. 2009. An axiomatic approach for result diversification. In Proceedings of the 18th international conference on World wide web. ACM, 381–390.
- Yuhong Guo and Dale Schuurmans. 2008. Discriminative batch mode active learning. In Advances in neural information processing systems. 593–600.
- An adaptive Metropolis algorithm. Bernoulli 7, 2 (2001), 223–242.
- Contrastive Preference Learning: Learning From Human Feedback without RL. In International Conference on Learning Representations (ICLR).
- Jonathan Hermon and Justin Salez. 2019. Modified log-Sobolev inequalities for strong-Rayleigh measures. arXiv preprint arXiv:1902.02775 (2019).
- Active comparison based learning incorporating user uncertainty and noise. In RSS Workshop on Model Learning for Human-Robot Communication.
- Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research 34, 10 (2015), 1296–1313.
- Preference-based learning of reward function features. arXiv preprint arXiv:2103.02727 (2021).
- Leonard Kaufman and Peter Rousseeuw. 1987. Clustering by means of medoids. North-Holland.
- An exact algorithm for maximum entropy sampling. Operations Research 43, 4 (1995), 684–691.
- Mykel J Kochenderfer and Tim A Wheeler. 2019. Algorithms for optimization. Mit Press.
- Active reinforcement learning: Observing rewards at a cost. In Future of Interactive Learning Machines, NIPS Workshop.
- Alex Kulesza and Ben Taskar. 2011. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). 1193–1200.
- Alex Kulesza and Ben Taskar. 2012. Determinantal Point Processes for Machine Learning. Now Publishers Inc., Hanover, MA, USA.
- When Humans Aren’t Optimal: Robots that Collaborate with Risk-Aware Humans. In ACM/IEEE International Conference on Human-Robot Interaction (HRI). https://doi.org/10.1145/3319502.3374832
- B-Pref: Benchmarking Preference-Based Reinforcement Learning. In Neural Information Processing Systems (NeurIPS).
- Sergey Levine and Vladlen Koltun. 2012. Continuous inverse optimal control with locally optimal examples. In Proceedings of the 29th International Conference on Machine Learning. 475–482.
- Fast mixing markov chains for strongly Rayleigh measures, DPPs, and constrained sampling. In Advances in Neural Information Processing Systems. 4188–4196.
- ROIAL: Region of Interest Active Learning for Characterizing Exoskeleton Gait Preference Landscapes. arXiv preprint arXiv:2011.04812 (2020).
- Efficient preference-based reinforcement learning using learned dynamics models. arXiv preprint arXiv:2301.04741 (2023).
- Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137.
- Controlling Assistive Robots with Learned Latent Actions. In International Conference on Robotics and Automation (ICRA). https://doi.org/10.1109/ICRA40945.2020.9197197
- Exponentiated Strongly Rayleigh Distributions. In Advances in Neural Information Processing Systems. 4464–4474.
- Learning multimodal rewards from rankings. In Conference on Robot Learning. PMLR, 342–352.
- Active Reward Learning from Online Preferences. In International Conference on Robotics and Automation (ICRA). https://doi.org/10.1109/ICRA48891.2023.10160439
- Aleksandar Nikolov. 2015. Randomized rounding for the largest simplex problem. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, 861–870.
- Learning Reward Functions by Integrating Human Demonstrations and Preferences. In Proceedings of Robotics: Science and Systems (RSS). https://doi.org/10.15607/rss.2019.xv.023
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290 (2023).
- Dorsa Sadigh. 2017. Safe and Interactive Autonomy: Control, Learning, and Verification. Ph.D. Dissertation. EECS Department, University of California, Berkeley.
- Active Preference-Based Learning of Reward Functions. In Proceedings of Robotics: Science and Systems (RSS).
- Planning for Autonomous Cars that Leverage Effects on Human Actions.. In Robotics: Science and Systems.
- Ozan Sener and Silvio Savarese. 2017. A geometric approach to active learning for convolutional neural networks. arXiv preprint arXiv:1708.00489 (2017).
- Interactive robot training for non-markov tasks. arXiv preprint arXiv:2003.02232 (2020).
- Preference-learning based inverse reinforcement learning for dialog control. In Thirteenth Annual Conference of the International Speech Communication Association.
- Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 5026–5033.
- Preference-based learning for exoskeleton gait optimization. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2351–2357.
- Maximizing global entropy reduction for active learning in speech recognition. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on. IEEE, 4721–4724.
- T Velmurugan and T Santhanam. 2010. Computational complexity between K-means and K-medoids clustering algorithms for normal and uniform distributions of data points. Journal of computer science 6, 3 (2010), 363.
- RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback. arXiv preprint arXiv:2402.03681 (2024).
- Submodularity in data subset selection and active learning. In International Conference on Machine Learning. 1954–1963.
- Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics bulletin 1, 6 (1945), 80–83.
- Nils Wilde and Javier Alonso-Mora. 2022. Do we use the Right Measure? Challenges in Evaluating Reward Learning Algorithms. In 6th Annual Conference on Robot Learning. https://openreview.net/forum?id=1vV0JRA2HY0
- Nils Wilde and Erdem Biyik. 2021. Learning Reward Functions from Scale Feedback. In Proceedings of the 5th Conference on Robot Learning (CoRL).
- Active Preference Learning using Maximum Regret. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
- A bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems. 1133–1141.
- Fetch and freight: Standard platforms for service robot applications. In Workshop on autonomous mobile service robots.
- Yazhou Yang and Marco Loog. 2019. Single shot active learning using pseudo annotators. Pattern Recognition 89 (2019), 22–31.
- Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision 113, 2 (2015), 113–127.
- Determinantal point processes for mini-batch diversification. In 33rd Conference on Uncertainty in Artificial Intelligence (UAI). AUAI Press Corvallis.
- Active mini-batch sampling using repulsive point processes. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5741–5748.
- Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8. Chicago, IL, USA, 1433–1438.
- Erdem Bıyık (46 papers)
- Nima Anari (43 papers)
- Dorsa Sadigh (162 papers)