Hierarchical Upper Confidence Bounds for Constrained Online Learning (2410.17216v2)
Abstract: The multi-armed bandit (MAB) problem is a foundational framework in sequential decision-making under uncertainty, extensively studied for its applications in areas such as clinical trials, online advertising, and resource allocation. Traditional MAB formulations, however, do not adequately capture scenarios where decisions are structured hierarchically, involve multi-level constraints, or feature context-dependent action spaces. In this paper, we introduce the hierarchical constrained bandits (HCB) framework, which extends the contextual bandit problem to incorporate hierarchical decision structures and multi-level constraints. We propose the hierarchical constrained upper confidence bound (HC-UCB) algorithm, designed to address the complexities of the HCB problem by leveraging confidence bounds within a hierarchical setting. Our theoretical analysis establishes sublinear regret bounds for HC-UCB and provides high-probability guarantees for constraint satisfaction at all hierarchical levels. Furthermore, we derive a minimax lower bound on the regret for the HCB problem, demonstrating the near-optimality of our algorithm. The results are significant for real-world applications where decision-making processes are inherently hierarchical and constrained, offering a robust and efficient solution that balances exploration and exploitation across multiple levels of decision-making.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
- Making contextual decisions with low technical debt. arXiv preprint arXiv:1606.03966, 2016.
- Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 989–1006, 2014.
- An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Conference on Learning Theory, pages 4–18. PMLR, 2016.
- Linear stochastic bandits under safety constraints. Advances in Neural Information Processing Systems, 32, 2019.
- P Auer. Finite-time analysis of the multiarmed bandit problem, 2002.
- Stochastic bandits for multi-platform budget optimization in online advertising. In Proceedings of the Web Conference 2021, pages 2805–2817, 2021.
- The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
- Bandits with knapsacks. Journal of the ACM (JACM), 65(3):1–55, 2018.
- Ali Baheri. Safe reinforcement learning with mixture density network, with application to autonomous driving. Results in Control and Optimization, 6:100095, 2022.
- LLMs-augmented contextual bandit. arXiv preprint arXiv:2311.02268, 2023.
- Deep reinforcement learning with enhanced safety for autonomous highway driving. In IEEE Intelligent Vehicles Symposium (IV), pages 1550–1555. IEEE, 2019.
- Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13:341–379, 2003.
- Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
- A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Multi-armed bandit allocation indices. John Wiley & Sons, 2011.
- A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330, 2022.
- Regret analysis for hierarchical experts bandit problem. arXiv preprint arXiv:2208.05622, 2022.
- Deep hierarchy in bandits. In International Conference on Machine Learning, pages 8833–8851. PMLR, 2022.
- Scout: An experienced guide to find the best cloud configuration. arXiv preprint arXiv:1803.01296, 2018.
- Conservative contextual linear bandits. Advances in Neural Information Processing Systems, 30, 2017.
- Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
- Multi-armed bandits with application to 5G small cells. IEEE Wireless Communications, 23(3):64–73, 2016.
- Safe linear Thompson sampling with side information. IEEE Transactions on Signal Processing, 69:3755–3767, 2021.
- Stochastic bandits with linear constraints. In International conference on artificial intelligence and statistics, pages 2827–2835. PMLR, 2021.
- Herbert Robbins. Some aspects of the sequential design of experiments. 1952.
- Risk-aversion in multi-armed bandits. Advances in neural information processing systems, 25, 2012.
- Safe exploration for optimization with Gaussian processes. In International Conference on Machine Learning (ICML).
- Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015.
- Conservative bandits. In International Conference on Machine Learning, pages 1254–1262. PMLR, 2016.
- Concurrent learning of control policy and unknown safety specifications in reinforcement learning. IEEE Open Journal of Control Systems, 2024.
- Li Zhou. A survey on contextual multi-armed bandits. arXiv preprint arXiv:1508.03326, 2015.