Achieving Instance-dependent Sample Complexity for Constrained Markov Decision Process (2402.16324v3)
Abstract: We consider the reinforcement learning problem for the constrained Markov decision process (CMDP), which plays a central role in satisfying safety or resource constraints in sequential learning and decision-making. In this problem, we are given finite resources and a MDP with unknown transition probabilities. At each stage, we take an action, collecting a reward and consuming some resources, all assumed to be unknown and need to be learned over time. In this work, we take the first step towards deriving optimal problem-dependent guarantees for the CMDP problems. We derive a logarithmic regret bound, which translates into a $O(\frac{1}{\Delta\cdot\epsilon}\cdot\log2(1/\epsilon))$ sample complexity bound, with $\Delta$ being a problem-dependent parameter, yet independent of $\epsilon$. Our sample complexity bound improves upon the state-of-art $O(1/\epsilon2)$ sample complexity for CMDP problems established in the previous literature, in terms of the dependency on $\epsilon$. To achieve this advance, we develop a new framework for analyzing CMDP problems. To be specific, our algorithm operates in the primal space and we resolve the primal LP for the CMDP problem at each period in an online manner, with adaptive remaining resource capacities. The key elements of our algorithm are: i) a characterization of the instance hardness via LP basis, ii) an eliminating procedure that identifies one optimal basis of the primal LP, and; iii) a resolving procedure that is adaptive to the remaining resources and sticks to the characterized optimal basis.
- Constrained policy optimization. In International conference on machine learning, pages 22–31. PMLR, 2017.
- Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83. PMLR, 2020.
- A dynamic near-optimal algorithm for online linear programming. Operations Research, 62(4):876–890, 2014.
- E. Altman. Constrained Markov decision processes. vol7. CRCPress, 1999.
- A. Arlotto and I. Gurvich. Uniformly bounded regret in the multisecretary problem. Stochastic Systems, 9(3):231–260, 2019.
- A. Arlotto and X. Xie. Logarithmic regret in the dynamic and stochastic knapsack problem with equal rewards. Stochastic Systems, 10(2):170–191, 2020.
- Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.
- N. Buchbinder and J. Naor. Online primal-dual algorithms for covering and packing. Mathematics of Operations Research, 34(2):270–286, 2009.
- P. Bumpensanti and H. Wang. A re-solving heuristic with uniformly bounded loss for network revenue management. Management Science, 66(7):2993–3009, 2020.
- A lyapunov-based approach to safe reinforcement learning. Advances in neural information processing systems, 31, 2018.
- Beyond value-function gaps: Improved instance-dependent regret bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 34:1–12, 2021.
- Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304–3312. PMLR, 2021.
- Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189, 2020.
- T. S. Ferguson. Who solved the secretary problem? Statistical science, 4(3):282–289, 1989.
- A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
- A. Flajolet and P. Jaillet. Logarithmic regret bounds for bandits with knapsacks. arXiv preprint arXiv:1510.01800, 2015.
- Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91:325–349, 2013.
- Achieving sub-linear regret in infinite horizon average reward constrained mdp with linear function approximation. In The Eleventh International Conference on Learning Representations, 2022.
- Algorithm for constrained markov decision process with linear convergence. In International Conference on Artificial Intelligence and Statistics, pages 11506–11533. PMLR, 2023.
- Asymptotically efficient adaptive choice of control laws incontrolled markov chains. SIAM journal on control and optimization, 35(3):715–743, 1997.
- A. Gupta and M. Molinaro. How experts can solve lps online. In European Symposium on Algorithms, pages 517–529. Springer, 2014.
- Nearly minimax optimal reinforcement learning for discounted mdps. Advances in Neural Information Processing Systems, 34:22288–22300, 2021.
- N. J. Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.
- J.-B. Hiriart-Urruty and C. Lemaréchal. Convex analysis and minimization algorithms I: Fundamentals, volume 305. Springer science & business media, 1996.
- Matrix analysis. Cambridge university press, 2012.
- S. Jasin and S. Kumar. A re-solving heuristic with bounded revenue loss for network revenue management with customer choice. Mathematics of Operations Research, 37(2):313–345, 2012.
- Online resource allocation in markov chains. In Proceedings of the ACM Web Conference 2023, pages 3498–3507, 2023.
- J. Jiang. Constant approximation for network revenue management with markovian-correlated customer arrivals. arXiv preprint arXiv:2305.05829, 2023.
- Degeneracy is ok: Logarithmic regret for network revenue management with indiscrete distributions. arXiv preprint arXiv:2210.07996, 2022.
- Y. Jin and A. Sidford. Efficiently solving mdps with stochastic mirror descent. In International Conference on Machine Learning, pages 4890–4900. PMLR, 2020.
- Primal beats dual on online packing lps in the random-order model. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 303–312, 2014.
- Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
- T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- T. Lattimore and M. Hutter. Pac bounds for discounted mdps. In Algorithmic Learning Theory: 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings 23, pages 320–334. Springer, 2012.
- Faster algorithm and sharper analysis for constrained markov decision process. arXiv preprint arXiv:2110.10351, 2021a.
- Revenue management with calendar-aware and dependent demands: Asymptotically tight fluid approximations. Available at SSRN 4543277, 2023.
- X. Li and Y. Ye. Online linear programming: Dual convergence, new algorithms, and regret bounds. Operations Research, 70(5):2948–2966, 2022.
- The symmetry between arms and knapsacks: A primal-dual approach for bandits with knapsacks. In International Conference on Machine Learning, pages 6483–6492. PMLR, 2021b.
- Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021a.
- Policy optimization for constrained mdps with provable fast global convergence. arXiv preprint arXiv:2111.00552, 2021b.
- Optimal regularized online convex allocation by adaptive re-solving. arXiv preprint arXiv:2209.00399, 2022.
- Adwords and generalized online matching. Journal of the ACM (JACM), 54(5):22–es, 2007.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- M. Molinaro and R. Ravi. The geometry of online packing linear programs. Mathematics of Operations Research, 39(1):46–59, 2014.
- A. Nedić and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142:205–228, 2009.
- Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39:555–571, 2015.
- An mdp-based recommender system. Journal of Machine Learning Research, 6(9), 2005.
- Near-optimal time and sample complexities for solving markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31, 2018.
- M. Simchowitz and K. G. Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32, 2019.
- Policy gradients with variance related risk criteria. In Proceedings of the twenty-ninth international conference on machine learning, pages 387–396, 2012.
- A. Vera and S. Banerjee. The bayesian prophet: A low-regret framework for online decision making. Management Science, 67(3):1368–1391, 2021.
- Online allocation and pricing: Constant regret via bellman inequalities. Operations Research, 69(3):821–840, 2021.
- Instance-optimality in interactive decision making: Toward a non-asymptotic theory. In The Thirty Sixth Annual Conference on Learning Theory, pages 1322–1472. PMLR, 2023.
- Beyond no regret: Instance-dependent pac reinforcement learning. In Conference on Learning Theory, pages 358–418. PMLR, 2022.
- M. J. Wainwright. Variance-reduced q𝑞qitalic_q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019.
- M. Wang. Randomized linear programming solves the markov decision problem in nearly linear (sometimes sublinear) time. Mathematics of Operations Research, 45(2):517–546, 2020.
- A provably-efficient model-free algorithm for constrained markov decision processes. arXiv preprint arXiv:2106.01577, 2021.
- A. Zanette and E. Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
- Anchor-changing regularized natural policy gradient for multi-objective reinforcement learning. Advances in Neural Information Processing Systems, 35:13584–13596, 2022.