Regret Minimization via Saddle Point Optimization (2403.10379v1)
Abstract: A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs. In the corresponding saddle-point game, the min-player optimizes the sampling distribution against an adversarial max-player that chooses confusing models leading to large regret. The most recent instantiation of this idea is the decision-estimation coefficient (DEC), which was shown to provide nearly tight lower and upper bounds on the worst-case expected regret in structured bandits and reinforcement learning. By re-parametrizing the offset DEC with the confidence radius and solving the corresponding min-max program, we derive an anytime variant of the Estimation-To-Decisions (E2D) algorithm. Importantly, the algorithm optimizes the exploration-exploitation trade-off online instead of via the analysis. Our formulation leads to a practical algorithm for finite model classes and linear feedback models. We further point out connections to the information ratio, decoupling coefficient and PAC-DEC, and numerically evaluate the performance of E2D on simple examples.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
- N. Abe and P. M. Long. Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 3–11, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1-55860-612-2.
- A. Agarwal and T. Zhang. Model-based rl with optimistic posterior sampling: Structural conditions and sample complexity. arXiv preprint arXiv:2206.07659, 2022.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
- Bandit convex optimization:\\\backslash\sqrtt regret in one dimension. In Conference on Learning Theory, pages 266–278. PMLR, 2015.
- N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
- Unified algorithms for rl with decision-estimation coefficients: No-regret, pac, and reward-free learning. arXiv preprint arXiv:2209.11745, 2022.
- Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 231–244. ACM, 2015. ISBN 978-1-4503-3486-0.
- Minimal exploration in structured stochastic bandits. Advances in Neural Information Processing Systems, 30, 2017.
- A provably efficient model-free posterior sampling method for episodic reinforcement learning. Advances in Neural Information Processing Systems, 34:12040–12051, 2021.
- Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pages 2432–2442. PMLR, 2020a.
- Structure adaptive algorithms for stochastic bandits. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2443–2452. PMLR, 13–18 Jul 2020b. URL https://proceedings.mlr.press/v119/degenne20b.html.
- A. V. den Boer. Dynamic pricing and learning: Historical origins, current research, and new directions. Surveys in Operations Research and Management Science, 20(1):1–18, 2015.
- K. Dong and T. Ma. Asymptotic instance-optimal algorithms for interactive decision making. arXiv preprint arXiv:2206.02326, 2022.
- Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826–2836. PMLR, 2021.
- J. C. Dunn and S. Harshbarger. Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2):432–444, 1978.
- The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
- A note on model-free reinforcement learning with the decision-estimation coefficient. arXiv preprint arXiv:2211.14250, 2022a.
- On the complexity of adversarial decision making. arXiv preprint arXiv:2206.13063, 2022b.
- Tight guarantees for interactive decision making with the decision-estimation coefficient. arXiv preprint arXiv:2301.08215, 2023.
- M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
- Adversarial attacks on linear contextual bandits. Advances in Neural Information Processing Systems, 33:14362–14373, 2020.
- Online learning: Algorithms for big data. Lecture Notes, 2013.
- High-dimensional sparse linear bandits. Advances in Neural Information Processing Systems, 33:10753–10763, 2020.
- M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning, pages 427–435. PMLR, 2013.
- Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
- J. Kirschner and A. Krause. Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pages 358–384. PMLR, 2018.
- J. Kirschner and A. Krause. Bias-robust bayesian optimization via dueling bandits. In International Conference on Machine Learning, pages 5595–5605. PMLR, 2021.
- Distributionally Robust Bayesian Optimization. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), August 2020a. URL http://proceedings.mlr.press/v108/kirschner20a/kirschner20a.pdf.
- Information directed sampling for linear partial monitoring. In Conference on Learning Theory, pages 2328–2369. PMLR, 2020b.
- Asymptotically optimal information-directed sampling. In Conference on Learning Theory, pages 2777–2821. PMLR, 2021.
- Linear partial monitoring for sequential decision making: Algorithms, regret bounds and applications. Journal of Machine Learning Research, August 2023.
- T. Lattimore and A. Gyorgy. Mirror descent and the information ratio. In Conference on Learning Theory, pages 2965–2992. PMLR, 2021.
- T. Lattimore and C. Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728–737. PMLR, 2017.
- T. Lattimore and C. Szepesvári. An information-theoretic approach to minimax regret in partial monitoring. In Conference on Learning Theory, pages 2111–2139. PMLR, 2019.
- T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020a.
- T. Lattimore and C. Szepesvári. Exploration by optimisation in partial monitoring. In Conference on Learning Theory, pages 2488–2515. PMLR, 2020b.
- Toprank: A practical algorithm for online stochastic ranking. In Advances in Neural Information Processing Systems, pages 3949–3958. Curran Associates, Inc., 2018.
- Combinatorial partial monitoring game with linear feedback and its applications. In International Conference on Machine Learning, pages 901–909, 2014.
- S. Mannor and O. Shamir. From bandits to experts: On the value of side-observations. Advances in Neural Information Processing Systems, 24, 2011.
- Deep exploration via bootstrapped dqn. Advances in neural information processing systems, 29, 2016.
- Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pages 784–791. ACM, 2008.
- D. Russo and B. Van Roy. Learning to optimize via information-directed sampling. Advances in Neural Information Processing Systems, 27, 2014.
- D. Russo and B. Van Roy. An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
- A. Rustichini. Minimizing regret: The general case. Games and Economic Behavior, 29(1-2):224–243, 1999.
- S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
- Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. International Conference on Machine Learning (ICML), 2010.
- W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.
- Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th International Conference on Machine Learning, pages 1201–1208. ACM, 2009.
- Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020.
- T. Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. SIAM Journal on Mathematics of Data Science, 4(2):834–857, 2022.
- Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532–4576. PMLR, 2021.