Bayesian Design Principles for Frequentist Sequential Learning (2310.00806v6)
Abstract: We develop a general theory to optimize the frequentist regret for sequential learning problems, where efficient bandit and reinforcement learning algorithms can be derived from unified Bayesian principles. We propose a novel optimization approach to generate "algorithmic beliefs" at each round, and use Bayesian posteriors to make decisions. The optimization objective to create "algorithmic beliefs," which we term "Algorithmic Information Ratio," represents an intrinsic complexity measure that effectively characterizes the frequentist regret of any algorithm. To the best of our knowledge, this is the first systematical approach to make Bayesian-type algorithms prior-free and applicable to adversarial settings, in a generic and optimal manner. Moreover, the algorithms are simple and often efficient to implement. As a major application, we present a novel algorithm for multi-armed bandits that achieves the "best-of-all-worlds" empirical performance in the stochastic, adversarial, and non-stationary environments. And we illustrate how these principles can be used in linear bandits, bandit convex optimization, and reinforcement learning.
- Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, 2008.
- Model-based RL with optimistic posterior sampling: Structural conditions and sample complexity. Advances in Neural Information Processing Systems, 35:35284–35297, 2022.
- Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646, 2014.
- Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012.
- Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135. PMLR, 2013.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, 2002a.
- The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002b.
- On a theorem of Danskin with an application to a theorem of Von Neumann-Sion. Nonlinear Analysis: Theory, Methods & Applications, 24(8):1163–1181, 1995.
- Dimitri P Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1999.
- Stochastic multi-armed-bandit problem with non-stationary rewards. Advances in Neural Information Processing Systems, 27, 2014.
- Vladimir Igorevich Bogachev and Maria Aparecida Soares Ruas. Measure Theory, volume 1. Springer, 2007.
- Multi-scale exploration of convex functions and bandit convex optimization. In Conference on Learning Theory, pages 583–589. PMLR, 2016.
- The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pages 42–1. JMLR Workshop and Conference Proceedings, 2012.
- Kernel-based methods for bandit convex optimization. In Proceedings of the 49th Annual ACM Symposium on Theory of Computing, pages 72–85, 2017.
- Unified algorithms for RL with decision-estimation coefficients: No-regret, PAC, and reward-free learning. arXiv preprint arXiv:2209.11745, 2022.
- The price of bandit information for online optimization. Advances in Neural Information Processing Systems, 20, 2007.
- On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633–679, 2020.
- Bilinear classes: A structural framework for provable generalization in RL. In International Conference on Machine Learning, pages 2826–2836. PMLR, 2021.
- Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199–3210. PMLR, 2020.
- Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33, 2020.
- The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
- A note on model-free reinforcement learning with the decision-estimation coefficient. arXiv preprint arXiv:2211.14250, 2022a.
- On the complexity of adversarial decision making. Advances in Neural Information Processing Systems, 35:35404–35417, 2022b.
- Tight guarantees for interactive decision making with the decision-estimation coefficient. arXiv preprint arXiv:2301.08215, 2023.
- Sara A Geer. Empirical Processes in M-estimation, volume 6. Cambridge University Press, 2000.
- Regret bounds for information-directed reinforcement learning. Advances in Neural Information Processing Systems, 35:28575–28587, 2022.
- Volumetric spanners: an efficient exploration basis for learning. The Journal of Machine Learning Research, 17(1):4062–4095, 2016.
- Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704–1713, 2017.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- Bellman Eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. Advances in Neural Information Processing Systems, 34, 2021.
- Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
- Tor Lattimore. Improved regret for zeroth-order adversarial bandit convex optimisation. Mathematical Statistics and Learning, 2(3):311–334, 2020.
- Mirror descent and the information ratio. In Conference on Learning Theory, pages 2965–2992. PMLR, 2021.
- An information-theoretic approach to minimax regret in partial monitoring. In Conference on Learning Theory, pages 2111–2139. PMLR, 2019.
- Bandit Algorithms. Cambridge University Press, 2020.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661–670, 2010.
- Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Ralph Tyrell Rockafellar. Convex Analysis. Princeton University Press, 2015.
- Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256–2264, 2013.
- Learning to optimize via information-directed sampling. Advances in Neural Information Processing Systems, 27, 2014.
- An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
- Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 47(3):1904–1931, 2022.
- Maurice Sion. On general minimax theorems. Pacific Journal of mathematics, 8(1):171–176, 1958.
- Myunghyun Song. Proving that the conditional entropy of a probability measure is concave. Mathematics Stack Exchange, https://math.stackexchange.com/q/3080334, 2019.
- Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pages 2898–2933. PMLR, 2019.
- William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
- Instance-optimality in interactive decision making: Toward a non-asymptotic theory. In The Thirty Sixth Annual Conference on Learning Theory, pages 1322–1472. PMLR, 2023.
- Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33:6123–6135, 2020.
- More adaptive algorithms for adversarial bandits. In Conference on Learning Theory, pages 1263–1291. PMLR, 2018.
- Towards optimal problem dependent generalization error bounds in statistical learning theory. arXiv preprint arXiv:2011.06186, 2020a.
- Upper counterfactual confidence bounds: A new optimism principle for contextual bandits. arXiv preprint arXiv:2007.07876, 2020b.
- Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995–7004. PMLR, 2019.
- Tong Zhang. From ε𝜀\varepsilonitalic_ε-entropy to KL-entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34(5):2180–2210, 2006.
- Tong Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. SIAM Journal on Mathematics of Data Science, 4(2):834–857, 2022.
- Contextual bandits with large action spaces: Made practical. In International Conference on Machine Learning, pages 27428–27453. PMLR, 2022.