Constraint-Generation Policy Optimization (CGPO): Nonlinear Programming for Policy Optimization in Mixed Discrete-Continuous MDPs (2401.12243v2)
Abstract: We propose the Constraint-Generation Policy Optimization (CGPO) framework to optimize policy parameters within compact and interpretable policy classes for mixed discrete-continuous Markov Decision Processes (DC-MDP). CGPO can not only provide bounded policy error guarantees over an infinite range of initial states for many DC-MDPs with expressive nonlinear dynamics, but it can also provably derive optimal policies in cases where it terminates with zero error. Furthermore, CGPO can generate worst-case state trajectories to diagnose policy deficiencies and provide counterfactual explanations of optimal actions. To achieve such results, CGPO proposes a bilevel mixed-integer nonlinear optimization framework for optimizing policies in defined expressivity classes (e.g. piecewise linear) and reduces it to an optimal constraint generation methodology that adversarially generates worst-case state trajectories. Furthermore, leveraging modern nonlinear optimizers, CGPO can obtain solutions with bounded optimality gap guarantees. We handle stochastic transitions through chance constraints, providing high-probability performance guarantees. We also present a roadmap for understanding the computational complexities of different expressivity classes of policy, reward, and transition dynamics. We experimentally demonstrate the applicability of CGPO across various domains, including inventory control, management of a water reservoir system, and physics control. In summary, CGPO provides structured, compact and explainable policies with bounded performance guarantees, enabling worst-case scenario generation and counterfactual policy diagnostics.
- Albert, L. A. 2022. A mixed-integer programming model for identifying intuitive ambulance dispatching policies. Journal of the Operational Research Society, 1–12.
- Chance-Constrained Path Planning with Continuous Time Safety Guarantees. In AAAI Workshop.
- Åström, K. J. 2012. Introduction to stochastic control theory. Courier Corporation.
- Infinitely constrained optimization problems. Journal of Optimization Theory and Applications, 19: 261–281.
- Symbolic dynamic programming for first-order MDPs. In IJCAI, volume 1, 690–700.
- Deep reactive policies for planning in stochastic nonlinear domains. In AAAI, volume 33, 7530–7537.
- Castro, P. M. 2015. Tightening piecewise McCormick relaxations for bilinear problems. Computers & Chemical Engineering, 72: 300–311.
- Scalable and Globally Optimal Generalized L1 k-center Clustering via Constraint Generation in Mixed Integer Linear Programming. In AAAI, volume 37, 7015–7023.
- A survey of algorithms for black-box safety validation of cyber-physical systems. Journal of Artificial Intelligence Research, 72: 377–428.
- Stochastic linear model predictive control with chance constraints–a review. Journal of Process Control, 44: 53–67.
- Symbolic heuristic search for factored Markov decision processes. In AAAI, 455–460.
- Gurobi Optimization, LLC. 2023. Gurobi Optimizer Reference Manual.
- Interval arithmetic: From principles to implementation. Journal of the ACM, 48(5): 1038–1068.
- PROST: Probabilistic planning based on UCT. In ICAPS, volume 22, 119–127.
- The sample average approximation method for stochastic discrete optimization. SIAM Journal on optimization, 12(2): 479–502.
- Sample-Efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs. In AAAI, volume 36, 9840–9848.
- Branch-and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discrete Optimization, 19: 79–102.
- Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39: 555–571.
- Bounded Finite State Controllers. In NeurIPS, 823–830.
- Puterman, M. L. 2014. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- Planning in factored action spaces with symbolic dynamic programming. In AAAI, volume 26, 1802–1808.
- Hindsight optimization for hybrid state and action MDPs. In AAAI, volume 31.
- Sanner, S. 2010. Relational dynamic influence diagram language (rddl): Language description. Unpublished ms. Australian National University, 32: 27.
- Symbolic dynamic programming for discrete and continuous state MDPs. In AAAI, 643–652.
- Nonlinear hybrid planning with deep net learned transition models and mixed-integer linear programming. In IJCAI, 750–756.
- Interval-based relaxation for general numeric planning. In ECAI, 655–663. IOS Press.
- The optimality of (S, s) policies in the dynamic inventory problem. Optimal pricing, inflation, and the cost of price adjustment, 49–56.
- Monte Carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, 56(3): 2497–2562.
- pyRDDLGym: From RDDL to Gym Environments. arXiv preprint arXiv:2211.05939.
- Iterative bounding mdps: Learning interpretable policies via non-interpretable methods. In AAAI, volume 35, 9923–9931.
- Optimal Decision Tree Policies for Markov Decision Processes. arXiv preprint arXiv:2301.13185.
- A survey on deploying mobile deep learning applications: A systemic and technical perspective. Digital Communications and Networks, 8(1): 1–17.
- Scalable planning with tensorflow for hybrid nonlinear domains. NeurIPS, 30.
- Symbolic dynamic programming for continuous state and action mdps. In AAAI, volume 26, 1839–1845.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.