Efficient and Adaptive Posterior Sampling Algorithms for Bandits (2405.01010v1)
Abstract: We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e{64}$ to $1270$. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-$\alpha$) and Thompson Sampling with Timestamp Duelling (TS-TD-$\alpha$), where $\alpha \in [0,1]$ controls the trade-off between utility and computation. Both algorithms achieve $O \left(K\ln{\alpha+1}(T)/\Delta \right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $\Delta$ denotes the single round performance loss when pulling a sub-optimal arm.
- Near-optimal regret bounds for Thompson Sampling. http://www.columbia.edu/~sa3305/papers/j3-corrected.pdf, 2017.
- Tuning bandit algorithms in stochastic environments. In International conference on algorithmic learning theory, pages 150–165. Springer, 2007.
- Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65, 2010.
- Finite-time analysis of the multi-armed bandit problem. Machine learning, 47:235–256, 2002.
- From optimality to robustness: Adaptive re-sampling strategies in stochastic bandits. Advances in Neural Information Processing Systems, 34:14029–14041, 2021.
- Maillard sampling: Boltzmann exploration done optimally. In International Conference on Artificial Intelligence and Statistics, pages 54–72. PMLR, 2022.
- The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pages 359–376. JMLR Workshop and Conference Proceedings, 2011.
- An asymptotically optimal bandit algorithm for bounded support models. In COLT, pages 67–79. Citeseer, 2010.
- Non-asymptotic analysis of a new bandit algorithm for semi-bounded rewards. J. Mach. Learn. Res., 16:3721–3756, 2015.
- MOTS: Minimax optimal Thompson Sampling. In International Conference on Machine Learning, pages 5074–5083. PMLR, 2021.
- Finite-time regret of Thompson Sampling algorithms for exponential family multi-armed bandits. Advances in Neural Information Processing Systems, 35:38475–38487, 2022.
- Thompson Sampling with less exploration is fast and optimal. 2023.
- On Bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pages 592–600. PMLR, 2012a.
- Thompson Sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory: 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings 23, pages 199–213. Springer, 2012b.
- Tor Lattimore. Refining the confidence level for optimistic bandit strategies. The Journal of Machine Learning Research, 19(1):765–796, 2018.
- A minimax and asymptotically optimal algorithm for stochastic bandits. In International Conference on Algorithmic Learning Theory, pages 223–237. PMLR, 2017.
- Bandit algorithms based on thompson sampling for bounded reward distributions. In Algorithmic Learning Theory, pages 777–826. PMLR, 2020.
- Bingshan Hu (5 papers)
- Zhiming Huang (11 papers)
- Tianyue H. Zhang (4 papers)
- Nidhi Hegde (15 papers)
- Mathias Lécuyer (17 papers)