DP-Dueling: Learning from Preference Feedback without Compromising User Privacy (2403.15045v1)
Abstract: We consider the well-studied dueling bandit problem, where a learner aims to identify near-optimal actions using pairwise comparisons, under the constraint of differential privacy. We consider a general class of utility-based preference matrices for large (potentially unbounded) decision spaces and give the first differentially private dueling bandit algorithm for active learning with user preferences. Our proposed algorithms are computationally efficient with near-optimal performance, both in terms of the private and non-private regret bound. More precisely, we show that when the decision space is of finite size $K$, our proposed algorithm yields order optimal $O\Big(\sum_{i = 2}K\log\frac{KT}{\Delta_i} + \frac{K}{\epsilon}\Big)$ regret bound for pure $\epsilon$-DP, where $\Delta_i$ denotes the suboptimality gap of the $i$-th arm. We also present a matching lower bound analysis which proves the optimality of our algorithms. Finally, we extend our results to any general decision space in $d$-dimensions with potentially infinite arms and design an $\epsilon$-DP algorithm with regret $\tilde{O} \left( \frac{d6}{\kappa \epsilon } + \frac{ d\sqrt{T }}{\kappa} \right)$, providing privacy for free when $T \gg d$.
- The price of differential privacy for online learning. In Proceedings of the 34th International Conference on Machine Learning, pages 32–40, 2017.
- Reducing dueling bandits to cardinal bandits. In ICML, volume 32, pages 856–864, 2014.
- Private online prediction from experts: Separations and faster rates. Proceedings of the Thirty Sixth Annual Conference on Computational Learning Theory, 2023.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In International Conference on Machine Learning, pages 551–560. PMLR, 2020.
- Near-optimal reinforcement learning with self-play. In Advances in Neural Information Processing Systems, 2020.
- Preference-based online learning with dueling bandits: A survey. Journal of Machine Learning Research, 2021.
- Stochastic contextual dueling bandits under linear stochastic transitivity models. In International Conference on Machine Learning, pages 1764–1786. PMLR, 2022.
- Aprel: A library for active preference-based reward learning algorithms. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 613–617. IEEE, 2022.
- Dueling optimization with a monotone adversary. In International Conference on Algorithmic Learning Theory, 2024.
- Multi-dueling bandits and their application to online ranker evaluation. CoRR, abs/1608.06253, 2016.
- Anaconda: An improved dynamic regret algorithm for adaptive non-stationary dueling bandits. In International Conference on Artificial Intelligence and Statistics, pages 3854–3878. PMLR, 2023.
- Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine Learning, 97(3):327–351, 2014.
- Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1–24, 2011.
- Efficient and trustworthy social navigation via explicit and implicit robot–human communication. IEEE Transactions on Robotics, 36(3):692–707, 2020a.
- Efficient and trustworthy social navigation via explicit and implicit robot–human communication. IEEE Transactions on Robotics, 36(3):692–707, 2020b.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
- Stochastic linear optimization under bandit feedback. 2008.
- Assortment optimization under the mallows model. In Advances in Neural Information Processing Systems, pages 4700–4708, 2016.
- Contextual dueling bandits. In Conference on Learning Theory, pages 563–587, 2015.
- Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Theory of Cryptography Conference, pages 265–284, 2006.
- Differential privacy under continual observation. In Proceedings of the Forty-Second Annual ACM Symposium on the Theory of Computing, pages 715–724, 2010.
- Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, 23, 2010a.
- Parametric bandits: The generalized linear case. Advances in neural information processing systems, 23, 2010b.
- One arrow, two kills: A unified framework for achieving optimal regret guarantees in sleeping bandits. In International Conference on Artificial Intelligence and Statistics, pages 7755–7773. PMLR, 2023.
- A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In Proceedings of the 32nd International Conference on Machine Learning, pages 218–227, 2015.
- Exploiting correlation to achieve faster learning rates in low-rank preference bandits. In International Conference on Artificial Intelligence and Statistics, pages 456–482. PMLR, 2022.
- Policy shaping: Integrating human feedback with reinforcement learning. Advances in neural information processing systems, 26, 2013.
- Optimal and efficient dynamic regret algorithms for non-stationary dueling bandits. In International Conference on Machine Learning, pages 19027–19049. PMLR, 2022.
- The psychology of preferences. Scientific American, 246(1):160–173, 1982.
- Non-stationary dueling bandits. arXiv preprint arXiv:2202.00935, 2022.
- Wataru Kumagai. Regret analysis for continuous dueling bandit. Advances in Neural Information Processing Systems, 30, 2017.
- Bandit algorithms. Cambridge University Press, 2020.
- ROIAL: Region of interest active learning for characterizing exoskeleton gait preference landscapes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 3212–3218. IEEE, 2021.
- Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pages 2071–2080. PMLR, 2017.
- Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438, 2023.
- A sharp analysis of model-based reinforcement learning with self-play. In International Conference on Machine Learning, pages 7001–7010. PMLR, 2021.
- Jirí Matousek. Construction of epsilon nets. In Proceedings of the fifth annual symposium on Computational geometry, pages 1–10, 1989.
- Cognitive control signals for neural prosthetics. Science, 305(5681):258–262, 2004.
- Dueling posterior sampling for preference-based reinforcement learning. arXiv preprint arXiv:1908.01289, 2019.
- OpenAI. ChatGPT: Optimizing language models for dialogue, 2022. URL https://openai.com/blog/chatgpt/. Accessed: 2023-01-23.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Dueling RL: reinforcement learning with trajectory preferences. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023.
- PAC ranking from pairwise and listwise queries: Lower bounds and upper bounds. arXiv preprint arXiv:1806.02970, 2018.
- April: Active preference learning-based reinforcement learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD. Springer Berlin Heidelberg, 2012.
- Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
- Aadirupa Saha. Optimal algorithms for stochastic contextual preference bandits. Advances in Neural Information Processing Systems, 34:30050–30062, 2021.
- Dueling bandits with adversarial sleeping. Advances in Neural Information Processing Systems, 34:27761–27771, 2021.
- Versatile dueling bandits: Best-of-both-world analyses for online learning from preferences. In International Conference on Machine Learning. PMLR, 2022.
- Stop relying on no-choice and do not repeat the moves: Optimal, efficient and practical algorithms for assortment optimization. arXiv preprint arXiv:2402.18917, 2024.
- Battle of bandits. In Uncertainty in Artificial Intelligence, 2018a.
- Active ranking with subset-wise preferences. International Conference on Artificial Intelligence and Statistics (AISTATS), 2018b.
- Combinatorial bandits with relative feedback. In Advances in Neural Information Processing Systems, 2019a.
- PAC Battling Bandits in the Plackett-Luce Model. In Algorithmic Learning Theory, pages 700–737, 2019b.
- From pac to instance-optimal sample complexity in the plackett-luce model. In International Conference on Machine Learning, pages 8367–8376. PMLR, 2020a.
- Best-item learning in random utility models with subset choices. In International Conference on Artificial Intelligence and Statistics, pages 4281–4291. PMLR, 2020b.
- Efficient and optimal algorithms for contextual dueling bandits under realizability. In International Conference on Algorithmic Learning Theory, pages 968–994. PMLR, 2022.
- Improved sleeping bandits with stochastic action sets and adversarial rewards. In International Conference on Machine Learning, pages 8357–8366. PMLR, 2020.
- Adversarial dueling bandits. In International Conference on Machine Learning, pages 9235–9244. PMLR, 2021a.
- Dueling convex optimization. In International Conference on Machine Learning, pages 9245–9254. PMLR, 2021b.
- Dueling convex optimization with general preferences. arXiv preprint arXiv:2210.02562, 2022.
- Faster convergence with multiway preferences. arXiv preprint arXiv:2312.11788, 2023.
- Eliciting user preferences for personalized multi-objective decision making through comparative feedback. Advances in Neural Information Processing Systems, 36, 2024.
- Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31, 2018.
- (Nearly) optimal algorithms for private online learning in full-information and bandit settings. In Advances in Neural Information Processing Systems 26, 2013.
- Multi-dueling bandits with dependent arms. In Conference on Uncertainty in Artificial Intelligence, UAI’17, 2017.
- Advancements in dueling bandits. In IJCAI, pages 5502–5510, 2018.
- Revenue management under a general discrete choice model of consumer behavior. Management Science, 50(1):15–33, 2004.
- Preference-based learning for exoskeleton gait optimization. In 2020 IEEE international conference on robotics and automation (ICRA), pages 2351–2357. IEEE, 2020.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- Model-free preference-based reinforcement learning. In 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pages 2222–2228, 2016.
- A survey of preference-based reinforcement learning methods. The Journal of Machine Learning Research, 18(1):4945–4990, 2017.
- Preference-based reinforcement learning with finite-time guarantees. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18784–18794. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/d9d3837ee7981e8c064774da6cdd98bf-Paper.pdf.
- D. Pal Y. Abbasi-Yadkori and C. Szepesvari. Improved algorithms for linear stochastic bandits. In Neural Information Processing Systems, 2011.
- Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201–1208. ACM, 2009.
- The k𝑘kitalic_k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
- Relative upper confidence bound for the k𝑘kitalic_k-armed dueling bandit problem. In JMLR Workshop and Conference Proceedings, number 32, pages 10–18. JMLR, 2014a.
- Relative confidence sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 73–82. ACM, 2014b.
- Copeland dueling bandits. In Advances in Neural Information Processing Systems, pages 307–315, 2015.