Efficiently Training Deep-Learning Parametric Policies using Lagrangian Duality
Abstract: Constrained Markov Decision Processes (CMDPs) are critical in many high-stakes applications, where decisions must optimize cumulative rewards while strictly adhering to complex nonlinear constraints. In domains such as power systems, finance, supply chains, and precision robotics, violating these constraints can result in significant financial or societal costs. Existing Reinforcement Learning (RL) methods often struggle with sample efficiency and effectiveness in finding feasible policies for highly and strictly constrained CMDPs, limiting their applicability in these environments. Stochastic dual dynamic programming is often used in practice on convex relaxations of the original problem, but they also encounter computational challenges and loss of optimality. This paper introduces a novel approach, Two-Stage Deep Decision Rules (TS-DDR), to efficiently train parametric actor policies using Lagrangian Duality. TS-DDR is a self-supervised learning algorithm that trains general decision rules (parametric policies) using stochastic gradient descent (SGD); its forward passes solve {\em deterministic} optimization problems to find feasible policies, and its backward passes leverage duality theory to train the parametric policy with closed-form gradients. TS-DDR inherits the flexibility and computational performance of deep learning methodologies to solve CMDP problems. Applied to the Long-Term Hydrothermal Dispatch (LTHD) problem using actual power system data from Bolivia, TS-DDR is shown to enhance solution quality and to reduce computation times by several orders of magnitude when compared to current state-of-the-art methods.
- MOSEK ApS. The MOSEK optimization toolbox for Julia manual. Version 10.1., 2024. URL https://github.com/MOSEK/Mosek.jl.
- Two-stage linear decision rules for multi-stage stochastic programming. Mathematical Programming, pages 1–34, 2022.
- J Carpentier. Contribution to the economic dispatch problem. Bulletin de la Societe Francoise des Electriciens, 3(8):431–447, 1962.
- Learning optimization proxies for large-scale security-constrained economic dispatch. Electric Power Systems Research, 213:108566, 2022.
- Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Machine Learning, 110(9):2419–2468, 2021.
- Emission-constrained optimization of gas networks: Input-convex neural network approach. In 2023 62nd IEEE Conference on Decision and Control (CDC), pages 1575–1579. IEEE, 2023.
- Machine learning meets mathematical optimization to predict the optimal production of offshore wind parks. Computers & Operations Research, 106:289–297, 2019.
- Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
- Fashionable modelling with flux. CoRR, abs/1811.01457, 2018. URL https://arxiv.org/abs/1811.01457.
- A differentiable programming system to bridge machine learning and scientific computing. arXiv preprint arXiv:1907.07587, 2019.
- End-to-end constrained optimization learning: A survey. arXiv preprint arXiv:2103.16378, 2021.
- Twenty years of application of stochastic dual dynamic programming in official and agent studies in brazil-main features and improvements on the newave model. In 2018 power systems computation conference (PSCC), pages 1–7. IEEE, 2018.
- Ali Mesbah. Stochastic model predictive control: An overview and perspectives for future research. IEEE Control Systems Magazine, 36(6):30–44, 2016.
- A survey of relaxations and approximations of the power flow equations. Foundations and Trends® in Electric Energy Systems, 4(1-2):1–221, 2019.
- Kevin P Murphy. A survey of pomdp solution techniques. environment, 2(10), 2000.
- Solving multistage stochastic linear programming via regularized linear decision rules: An application to hydrothermal dispatch planning. European Journal of Operational Research, 309(1):345–358, 2023.
- PACE. Partnership for an Advanced Computing Environment (PACE), 2017. URL http://www.pace.gatech.edu.
- Multi-stage stochastic optimization applied to energy planning. Mathematical programming, 52:359–375, 1991.
- Warren B Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007.
- Warren B Powell. A unified framework for optimization under uncertainty. In Optimization challenges in complex, networked and risky systems, pages 45–83. INFORMS, 2016.
- Stochastic programming in transportation and logistics. Handbooks in operations research and management science, 10:555–635, 2003.
- PSR. Software | PSR, 2019. URL http://www.psr-inc.com/softwares-en/. [Online; accessed 2019-07-06].
- Assessing the cost of network simplifications in long-term hydrothermal dispatch planning models. IEEE Transactions on Sustainable Energy, 13(1):196–206, 2021.
- Lectures on stochastic programming: modeling and theory. SIAM, 2009.
- Accelerating optimal power flow with GPUs: SIMD abstraction of nonlinear programs and condensed-space interior-point methods. arXiv preprint arXiv:2307.16830, 2023.
- Time-consistent risk-constrained dynamic portfolio optimization with transactional costs and time-dependent returns. Annals of Operations Research, 282:379–405, 2019.
- Pascal Van Hentenryck. Machine learning for optimal power flows. Tutorials in Operations Research: Emerging Optimization Methods and Modeling Techniques with Applications, pages 62–82, 2021.
- Q-learning. Machine learning, 8:279–292, 1992.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.