Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning (2306.08803v1)
Abstract: Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limits their applicability in realistic scenarios. Moreover, sequential decision making problems are often carried out in a batched manner, either due to the inherent nature of the problem or to serve the purpose of reducing communication and computation costs. In this work, we jointly study these problems in two popular settings, namely, stochastic multi-armed bandits (MABs) and infinite-horizon reinforcement learning (RL), where TS is used to learn the unknown reward distributions and transition dynamics, respectively. We propose batched $\textit{Langevin Thompson Sampling}$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $\mathcal{O}(\log T)$ for stochastic MABs, and $\mathcal{O}(\sqrt{T})$ for RL. We complement our theoretical findings with experimental results.
- Bayesian optimal control of smoothly parameterized systems. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. AUAI Press.
- Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings.
- Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pages 99–107. PMLR.
- Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. Advances in Neural Information Processing Systems, 30.
- Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. arXiv preprint arXiv:1205.2661.
- Bertsekas, D. (2012). Dynamic programming and optimal control: Volume I, volume 1. Athena scientific.
- Bojun, H. (2020). Steady state analysis of episodic reinforcement learning. Advances in Neural Information Processing Systems, 33:9335–9345.
- Safe imitation learning via fast bayesian reward inference from preferences. In International Conference on Machine Learning, pages 1165–1177. PMLR.
- An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24:2249–2257.
- Convergence of langevin mcmc in kl-divergence. In Algorithmic Learning Theory, pages 186–211. PMLR.
- Batched multi-armed bandits problem.
- Thompson Sampling for Learning Parameterized Markov Decision Processes. In Proceedings of The 28th Conference on Learning Theory. PMLR.
- Granmo, O.-C. (2010). Solving two-armed bernoulli bandit problems using a bayesian learning automaton. International Journal of Intelligent Computing and Cybernetics.
- Better optimism by bayes: Adaptive planning with rich models. arXiv preprint arXiv:1402.1958.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.
- Optimality of thompson sampling for gaussian bandits depends on priors.
- Mirrored langevin dynamics. Advances in Neural Information Processing Systems, 31.
- Bayesian control of large mdps with unknown dynamics in data-poor environments. Advances in neural information processing systems, 31.
- Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600.
- Is q-learning provably efficient? Advances in neural information processing systems, 31.
- Provablylearning to optimize via posterior sampling efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
- Thompson sampling in non-episodic restless bandits. arXiv preprint arXiv:1910.05654.
- Batched thompson sampling. Advances in Neural Information Processing Systems, 34:29984–29994.
- Parallelizing thompson sampling. Advances in Neural Information Processing Systems, 34:10535–10548.
- Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, pages 199–213. Springer.
- Stochastic first-order methods for average-reward markov decision processes. arXiv preprint arXiv:2205.05800.
- Ensemble sampling. arXiv preprint arXiv:1705.07347.
- A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 28.
- Optimistic bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, 13:2069–2106.
- On approximate Thompson sampling with Langevin algorithms. In ICML, volume 119, pages 6797–6807.
- (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26.
- Model-based reinforcement learning and the eluder dimension. Advances in Neural Information Processing Systems, 27.
- Posterior sampling for reinforcement learning without episodes. arXiv preprint arXiv:1608.02731.
- Learning unknown markov decision processes: A thompson sampling approach. Advances in neural information processing systems, 30.
- Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127.
- Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243.
- Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522.
- An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331.
- Strens, M. (2000). A bayesian framework for reinforcement learning. In ICML, volume 2000, pages 943–950.
- An interactive points of interest guidance system. In Proceedings of the 22nd International Conference on Intelligent User Interfaces Companion, IUI ’17 Companion.
- Posterior sampling for large scale reinforcement learning. arXiv preprint arXiv:1711.07979.
- Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294.
- Towards minimax optimal reinforcement learning in factored markov decision processes. Advances in Neural Information Processing Systems, 33:19896–19907.
- Near-optimal optimistic reinforcement learning using empirical bernstein inequalities. arXiv preprint arXiv:1905.12425.
- Linear bandits with stochastic delayed feedback. In International Conference on Machine Learning, pages 9712–9721. PMLR.
- Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007–3015. PMLR.
- Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International conference on machine learning, pages 10170–10180. PMLR.
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688.
- Nearly minimax optimal regret for learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3883–3913. PMLR.
- Langevin monte carlo for contextual bandits. In International Conference on Machine Learning, pages 24830–24850. PMLR.
- Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR.
- On function approximation in reinforcement learning: Optimism in the face of large state spaces. arXiv preprint arXiv:2011.04622.
- Zhang, T. (2022). Feel-good thompson sampling for contextual bandits and reinforcement learning. SIAM Journal on Mathematics of Data Science, 4(2):834–857.
- Neural thompson sampling.