Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
10 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Perturbed-History Exploration in Stochastic Linear Bandits (1903.09132v2)

Published 21 Mar 2019 in cs.LG and stat.ML

Abstract: We propose a new online algorithm for cumulative regret minimization in a stochastic linear bandit. The algorithm pulls the arm with the highest estimated reward in a linear model trained on its perturbed history. Therefore, we call it perturbed-history exploration in a linear bandit (LinPHE). The perturbed history is a mixture of observed rewards and randomly generated i.i.d. pseudo-rewards. We derive a $\tilde{O}(d \sqrt{n})$ gap-free bound on the $n$-round regret of LinPHE, where $d$ is the number of features. The key steps in our analysis are new concentration and anti-concentration bounds on the weighted sum of Bernoulli random variables. To show the generality of our design, we generalize LinPHE to a logistic model. We evaluate our algorithms empirically and show that they are practical.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pages 2312–2320, 2011.
  2. Further optimal regret bounds for Thompson sampling. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics, pages 99–107, 2013a.
  3. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pages 127–135, 2013b.
  4. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002.
  5. Sub-sampling for multi-armed bandits. In Proceeding of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2014.
  6. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24, pages 2249–2257, 2011.
  7. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pages 355–366, 2008.
  8. Luc Devroye. Non-Uniform Random Variate Generation. Springer-Verlag, New York, NY, 1986.
  9. Thompson sampling with the online bootstrap. CoRR, abs/1410.4009, 2014. URL http://arxiv.org/abs/1410.4009.
  10. A practical method for solving contextual bandit problems using decision trees. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence, 2017.
  11. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems 23, pages 586–594, 2010.
  12. Thompson sampling for complex online problems. In Proceedings of the 31st International Conference on Machine Learning, pages 100–108, 2014.
  13. James Hannan. Approximation to Bayes risk in repeated play. In Contributions to the Theory of Games, volume 3, pages 97–140. Princeton University Press, Princeton, NJ, 1957.
  14. Scalable generalized linear bandits: Online computation and hashing. In Advances in Neural Information Processing Systems 30, pages 99–109, 2017.
  15. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
  16. Efficient Thompson sampling for online matrix-factorization recommendation. In Advances in Neural Information Processing Systems 28, pages 1297–1305, 2015.
  17. Perturbed-history exploration in stochastic multi-armed bandits. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019a.
  18. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In Proceedings of the 36th International Conference on Machine Learning, pages 3601–3610, 2019b.
  19. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
  20. Bandit Algorithms. Cambridge University Press, 2019.
  21. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, pages 2071–2080, 2017.
  22. BBQ-networks: Efficient exploration in deep reinforcement learning for task-oriented dialogue systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 5237–5244, 2018.
  23. Customized nonlinear bandits for online response selection in neural conversation models. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 5245–5252, 2018.
  24. Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. In Advances in Neural Information Processing Systems 30, pages 3258–3266, 2017.
  25. An efficient algorithm for learning with semi-bandit feedback. In Proceedings of the 24th International Conference on Algorithmic Learning Theory, pages 234–248, 2013.
  26. Ian Osband and Benjamin Van Roy. Bootstrapped Thompson sampling and deep exploration. CoRR, abs/1507.00300, 2015. URL http://arxiv.org/abs/1507.00300.
  27. Deep Bayesian bandits showdown: An empirical comparison of Bayesian deep networks for Thompson sampling. In Proceedings of the 6th International Conference on Learning Representations, 2018.
  28. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  29. A tutorial on Thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96, 2018.
  30. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
  31. Personalized recommendation via parameter-free contextual bandits. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 323–332, 2015.
  32. William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
  33. Spectral bandits for smooth graph functions. In Proceedings of the 31st International Conference on Machine Learning, pages 46–54, 2014.
  34. New insights into bootstrapping for bandits. CoRR, abs/1805.09793, 2018. URL http://arxiv.org/abs/1805.09793.
Citations (41)

Summary

We haven't generated a summary for this paper yet.