Perturbed-History Exploration in Stochastic Linear Bandits (1903.09132v2)
Abstract: We propose a new online algorithm for cumulative regret minimization in a stochastic linear bandit. The algorithm pulls the arm with the highest estimated reward in a linear model trained on its perturbed history. Therefore, we call it perturbed-history exploration in a linear bandit (LinPHE). The perturbed history is a mixture of observed rewards and randomly generated i.i.d. pseudo-rewards. We derive a $\tilde{O}(d \sqrt{n})$ gap-free bound on the $n$-round regret of LinPHE, where $d$ is the number of features. The key steps in our analysis are new concentration and anti-concentration bounds on the weighted sum of Bernoulli random variables. To show the generality of our design, we generalize LinPHE to a logistic model. We evaluate our algorithms empirically and show that they are practical.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312–2320, 2011.
- Further optimal regret bounds for Thompson sampling. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics, pages 99–107, 2013a.
- Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pages 127–135, 2013b.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002.
- Sub-sampling for multi-armed bandits. In Proceeding of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2014.
- An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pages 2249–2257, 2011.
- Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pages 355–366, 2008.
- Luc Devroye. Non-Uniform Random Variate Generation. Springer-Verlag, New York, NY, 1986.
- Thompson sampling with the online bootstrap. CoRR, abs/1410.4009, 2014. URL http://arxiv.org/abs/1410.4009.
- A practical method for solving contextual bandit problems using decision trees. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence, 2017.
- Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23, pages 586–594, 2010.
- Thompson sampling for complex online problems. In Proceedings of the 31st International Conference on Machine Learning, pages 100–108, 2014.
- James Hannan. Approximation to Bayes risk in repeated play. In Contributions to the Theory of Games, volume 3, pages 97–140. Princeton University Press, Princeton, NJ, 1957.
- Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems 30, pages 99–109, 2017.
- Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
- Efficient Thompson sampling for online matrix-factorization recommendation. In Advances in Neural Information Processing Systems 28, pages 1297–1305, 2015.
- Perturbed-history exploration in stochastic multi-armed bandits. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019a.
- Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In Proceedings of the 36th International Conference on Machine Learning, pages 3601–3610, 2019b.
- Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- Bandit Algorithms. Cambridge University Press, 2019.
- Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, pages 2071–2080, 2017.
- BBQ-networks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 5237–5244, 2018.
- Customized nonlinear bandits for online response selection in neural conversation models. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 5245–5252, 2018.
- Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. In Advances in Neural Information Processing Systems 30, pages 3258–3266, 2017.
- An efficient algorithm for learning with semi-bandit feedback. In Proceedings of the 24th International Conference on Algorithmic Learning Theory, pages 234–248, 2013.
- Ian Osband and Benjamin Van Roy. Bootstrapped Thompson sampling and deep exploration. CoRR, abs/1507.00300, 2015. URL http://arxiv.org/abs/1507.00300.
- Deep Bayesian bandits showdown: An empirical comparison of Bayesian deep networks for Thompson sampling. In Proceedings of the 6th International Conference on Learning Representations, 2018.
- Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
- A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
- Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
- Personalized recommendation via parameter-free contextual bandits. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 323–332, 2015.
- William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
- Spectral bandits for smooth graph functions. In Proceedings of the 31st International Conference on Machine Learning, pages 46–54, 2014.
- New insights into bootstrapping for bandits. CoRR, abs/1805.09793, 2018. URL http://arxiv.org/abs/1805.09793.