Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement (2402.15703v1)
Abstract: What can an agent learn in a stochastic Multi-Armed Bandit (MAB) problem from a dataset that contains just a single sample for each arm? Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples. Our analysis reveals that \emph{stochastic policies can be substantially better} than deterministic ones for offline decision-making. Focusing on offline multi-armed bandits, we design an algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST) which is quite different from the predominant value-based lower confidence bound approach. Its design is enabled by localization laws, critical radii, and relative pessimism. We prove that its sample complexity is comparable to that of LCB on minimax problems while being substantially lower on problems with very few samples. Finally, we consider an application to offline reinforcement learning in the special case where the logging policies are known.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), 2011.
- Second-order cone programming. Mathematical programming, 95(1):3–51, 2003.
- Offline contextual multi-armed bandits for mobile health interventions: A case study on emotion regulation. In Proceedings of the 14th ACM Conference on Recommender Systems, pp. 249–258, 2020.
- Minimax policies for adversarial and stochastic bandits. In COLT, volume 7, pp. 1–122, 2009.
- Best arm identification in multi-armed bandits. In COLT, pp. 41–53, 2010.
- Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
- Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.
- Reinforcement learning for the adaptive scheduling of educational activities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12, 2020.
- Bellec, P. C. Localized gaussian width of m-convex hulls with applications to lasso and convex aggregation. 2019.
- Convex Optimization. Cambridge University Press, 2004.
- Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016.
- Adversarially trained actor critic for offline reinforcement learning. arXiv preprint arXiv:2202.02446, 2022.
- Combinatorial bandits revisited. Advances in neural information processing systems, 28, 2015.
- Variance-aware sparse linear bandits. arXiv preprint arXiv:2205.13450, 2022.
- Anytime optimal algorithms in stochastic multi-armed bandits. In International Conference on Machine Learning, pp. 1587–1595. PMLR, 2016.
- Cvxpy: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016.
- Policy evaluation from a single path: Multi-step methods, mixing and mis-specification. arXiv preprint arXiv:2211.03899, 2022.
- A finite-sample analysis of multi-step temporal difference estimates. In Learning for Dynamics and Control Conference, pp. 612–624. PMLR, 2023.
- Optimal policy evaluation using kernel-based temporal difference methods. arXiv preprint arXiv:2109.12002, 2021.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019.
- Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp. 998–1027. PMLR, 2016.
- Geer, S. A. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.
- Gaussian averages of interpolated bodies and applications to approximate reconstruction. Journal of Approximation Theory, 149(1):59–73, 2007.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
- Bootstrapping statistical inference for off-policy evaluation. arXiv preprint arXiv:2102.03607, 2021.
- Volumetric spanners: an efficient exploration basis for learning. Journal of Machine Learning Research, 2016.
- Texplore: real-time sample-efficient reinforcement learning for robots. Machine learning, 90:385–429, 2013.
- Is pessimism provably efficient for offline rl? arXiv preprint arXiv:2012.15085, 2020.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084–5096. PMLR, 2021.
- Scalable generalized linear bandits: Online computation and hashing. Advances in Neural Information Processing Systems, 30, 2017.
- Improved regret analysis for variance-adaptive linear bandits and horizon-free linear mixture mdps. Advances in Neural Information Processing Systems, 35:1060–1072, 2022.
- Koltchinskii, V. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.
- Koltchinskii, V. Local rademacher complexities and oracle inequalities in risk minimization. 2006.
- Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
- Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779, 2020.
- Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pp. 535–543. PMLR, 2015.
- Lai, T. L. Adaptive treatment allocation and the multi-armed bandit problem. The annals of statistics, pp. 1091–1114, 1987.
- Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- The epoch-greedy algorithm for multi-armed bandits with side information. Advances in neural information processing systems, 20, 2007.
- Bandit Algorithms. Cambridge University Press, 2020.
- Learning subgaussian classes: Upper and minimax bounds. arXiv preprint arXiv:1305.4825, 2013.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Settling the sample complexity of model-based offline reinforcement learning. arXiv preprint arXiv:2204.05275, 2022.
- Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438, 2023.
- Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review. Robotics, 10(1):22, 2021.
- Variable selection via thompson sampling. Journal of the American Statistical Association, 118(541):287–304, 2023.
- Deep reinforcement learning for dynamic treatment regimes on medical registry data. In 2017 IEEE international conference on healthcare informatics (ICHI), pp. 380–385. IEEE, 2017.
- Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp. 7599–7608. PMLR, 2021.
- Variance-aware off-policy evaluation with linear function approximation. Advances in neural information processing systems, 34:7598–7610, 2021.
- Off-policy estimation of linear functionals: Non-asymptotic theory for semi-parametric efficiency. arXiv preprint arXiv:2209.13075, 2022.
- Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
- The deep bootstrap framework: Good online learners are good offline generalizers. arXiv preprint arXiv:2010.08127, 2020.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. arXiv preprint arXiv:2103.12021, 2021.
- Reinforcement learning tutor better supported lower performers in a math task. arXiv preprint arXiv:2304.04933, 2023.
- Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
- Russo, D. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pp. 1417–1418. PMLR, 2016.
- Distributionally robust policy evaluation and learning in offline contextual bandits. In International Conference on Machine Learning, pp. 8884–8894. PMLR, 2020.
- Reinforcement learning: An introduction. MIT Press, 2018.
- High confidence policy improvement. In International Conference on Machine Learning, pp. 2380–2388. PMLR, 2015.
- Vershynin, R. High-dimensional probability. University of California, Irvine, 2020.
- Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
- Bootstrapped transformer for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:34748–34761, 2022.
- Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pp. 5114–5122. PMLR, 2018.
- From gauss to kolmogorov: Localized measures of complexity for ellipses. 2020.
- Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
- Batch value-function approximation with only realizability. arXiv preprint arXiv:2008.04990, 2020.
- Bellman-consistent pessimism for offline reinforcement learning. arXiv preprint arXiv:2106.06926, 2021.
- Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. arXiv preprint arXiv:2205.15512, 2022.
- Towards instance-optimal offline reinforcement learning with pessimism. arXiv preprint arXiv:2110.08695, 2021.
- Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
- Almost horizon-free structure-aware best policy identification with a generative model. In Advances in Neural Information Processing Systems, 2019.
- Provably efficient reward-agnostic navigation with linear value iteration. In Advances in Neural Information Processing Systems, 2020.
- Provable benefits of actor-critic methods for offline reinforcement learning. arXiv preprint arXiv:2108.08812, 2021.
- Off-policy fitted q-evaluation with differentiable function approximators: Z-estimation and inference theory. In International Conference on Machine Learning, pp. 26713–26749. PMLR, 2022.
- Improved variance-aware confidence sets for linear bandits and linear mixture mdp. Advances in Neural Information Processing Systems, 34:4342–4355, 2021.