Bayesian Regret Minimization in Offline Bandits
Abstract: We study how to make decisions that minimize Bayesian regret in offline linear bandits. Prior work suggests that one must take actions with maximum lower confidence bound (LCB) on their reward. We argue that the reliance on LCB is inherently flawed in this setting and propose a new algorithm that directly minimizes upper bounds on the Bayesian regret using efficient conic optimization solvers. Our bounds build heavily on new connections to monetary risk measures. Proving a matching lower bound, we show that our upper bounds are tight, and by minimizing them we are guaranteed to outperform the LCB approach. Our numerical results on synthetic domains confirm that our approach is superior to LCB.
- Ahmadi-Javid, A. Entropic Value-at-Risk: A new coherent risk measure. Journal of Optimization Theory and Applications, 155(3):1105–1123, 2012.
- Exploitation vs caution: Risk-sensitive policies for offline learning. arXiv:2105.13431 [cs, eess], 2021.
- ApS, M. The MOSEK optimization toolbox for MATLAB manual. Version 10.0., 2022.
- Optimizing percentile criterion using robust MDPs. In International Conference on Artificial Intelligence and Statistics (AIStats), 2021.
- Robust Optimization. Princeton University Press, 2009.
- Probabilistic Guarantees in Robust Optimization. SIAM Journal on Optimization, 31(4):2893–2920, 2021.
- Bayesian robust optimization for imitation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Uncertain convex programs: Randomized solutions and confidence levels. Mathematical Programming, Series A, 102:25–46, 2005.
- Adversarially trained actor critic for offline reinforcement learning, 2022.
- Order Statistics. John Wiley & Sons, Ltd, 3 edition, 2003.
- Mathematics for Machine Learning. 2021.
- Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research, 58(1):203–213, 2010.
- Stochastic Finance: Introduction in Discrete Time. De Gruyter Graduate, fourth edition, 2016.
- Bayesian Data Analysis. Chapman and Hall/CRC, 3rd edition, 2014.
- Offline RL policies should be trained to be adaptive, 2022.
- Gupta, V. Near-optimal bayesian ambiguity sets for distributionally robust optimization. Management Science, 65(9):4242–4260, 2019.
- Entropic risk optimization in discounted MDPs. In Artificial Intelligence and Statistics (AISTATS), 2023.
- Multi-task off-policy learning from bandit feedback. In International Conference on Machine Learning, 2023.
- Matrix Analysis. Cambridge University Press, second edition, 2013.
- A tail inequality for quadratic forms of subgaussian random vectors. 2012.
- Policy gradient Bayesian robust optimization for imitation learning. In International Conference on Machine Learning (ICML), 2021.
- A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736, 2019.
- Is pessimism provably efficient for offline RL?, 2022.
- Batch reinforcement learning. In Reinforcement Learning, pp. 45–73. 2012.
- Bandit Algorithms. 2018.
- Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302–1338, 2000.
- Percentile criterion optimization in offline reinforcement learning. In Neural Information Processing Systems (NeurIPS), 2023.
- Soft-Robust Algorithms for Handling Model Misspecification, 2020.
- Jump 1.0: Recent improvements to a modeling language for mathematical optimization. Mathematical Programming Computation, 2023.
- A sample approximation approach for optimization with probabilistic constraints. SIAM Journal on Optimization, 19(2):674–699, 2008.
- Scenario Approximations of Chance Constraints. In Calafiore, G. and Dabbene, F. (eds.), Probabilistic and Randomized Methods for Design under Uncertainty, pp. 3–47. Springer, 2006.
- Convex approximations of chance constrained programs. SIAM Journal on Optimization, 17(4):969–996, 2007.
- Beyond confidence regions: Tight Bayesian ambiguity sets for robust MDPs. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Pintér, J. Deterministic approximations of probability inequalities. Zeitschrift für Operations-Research, 33(4):219–239, 1989.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. IEEE Transactions on Information Theory, 68(12):8156–8196, 2022.
- Gaussian Processes for Machine Learning. 2006.
- Variational Analysis. 2009.
- Lectures on Stochastic Programming: Modeling and Theory. SIAM, 2014.
- Multi-model Markov decision processes. IISE Transactions, 53, 2021.
- Solving multi-model MDPs by coordinate ascent and dynamic programming,. In Uncertainty in Artificial Intelligence (UAI), 2023.
- Pessimistic model-based offline reinforcement learning under partial coverage. In International Conference on Learning Representations (ICLR), 2023.
- Vaart, A. W. V. D. Asymptotic Statistics. Cambridge University Press, 2000.
- Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
- On the optimality of batch policy optimization algorithms. In International Conference of Machine Learning (ICML), 2021.
- Bellman-consistent pessimism for offline reinforcement learning. In Advances in Neural Information Processing Systems, volume 34, pp. 6683–6694, 2021.
- ARMOR: A model-based framework for improving arbitrary baseline policies with offline data, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.