Policy Optimization-Based Restoration Framework

Updated 23 December 2025

Policy Optimization-Based Restoration Framework is a control-theoretic paradigm that formulates infrastructure recovery as a sequential decision process using MDP formulations and reinforcement learning.
It leverages techniques such as dynamic programming, PPO, and constrained optimization to achieve efficient, scalable, and equitable recovery of power distribution networks.
The framework integrates domain-specific physics, operational constraints, and uncertainty modeling to optimize restoration performance under complex real-world conditions.

A policy optimization-based restoration framework is a control-theoretic and algorithmic paradigm for recovering critical infrastructure systems—predominantly power distribution networks—following large-scale exogenous disruptions such as natural disasters. These frameworks formulate restoration as a sequential decision process, model relevant physical, resource, and operational constraints, and compute (or learn) a restoration policy by directly optimizing expected restoration performance, typically through Markov decision process (MDP) formulations and modern policy optimization techniques. Recent advances address nonlinearity, uncertainty, heterogeneity, and scalability via reinforcement learning (RL), rollout dynamic programming, graph neural policies, and constrained optimization, tailored to the domain's physical constraints and operational uncertainty (Dolatyabi et al., 18 Nov 2025, Nozhati et al., 2018, Işık et al., 2024, Li et al., 21 Dec 2025, Zhang et al., 2022, Bose et al., 2021, Maurer et al., 24 Jun 2025, Jiang et al., 6 Aug 2025).

1. Formal Markov Decision Process Formulations

Policy-optimization restoration frameworks consistently formulate the system recovery task as either a standard MDP, a constrained MDP (CMDP), or a multi-agent extension, with:

State space encoding grid topology, component health, DG/DER/ESS states, crew locations, and/or multi-network (e.g., power-water) coupling (Nozhati et al., 2018, Işık et al., 2024, Dolatyabi et al., 18 Nov 2025).
Action space comprising switching operations, repair assignments, load/DER dispatch setpoints, mobile generator/crew deployments, or energization attempts, subject to physical and topological constraints (Gol et al., 2019, Maurer et al., 24 Jun 2025, Li et al., 21 Dec 2025).
Transition function determined either by full-physics simulators (such as OpenDSS AC power flow), simplified feasibility tests, or empirical repair/renewable uncertainty models (Dolatyabi et al., 18 Nov 2025, Zhang et al., 2022).
Reward structure balancing restored load or users served, cost or time minimization, and penalty signals for physical (voltage, thermal, feasibility) or societal objectives (e.g., equity metrics) (Jiang et al., 6 Aug 2025, Dolatyabi et al., 18 Nov 2025).

The control problem seeks an optimal or approximately optimal policy $\pi^*$ that maximizes expected accumulated reward or minimizes restoration time and cost, often subject to hard or soft constraints (Nozhati et al., 2018, Bose et al., 2021).

2. Core Policy Optimization Algorithms

A diverse set of policy optimization methods have been systematically investigated, each tailored for problem scale, stochasticity, and constraint structure:

Dynamic Programming and Value Iteration: For small to medium models, value iteration computes the optimal state-value function $V^*(s)$ and induces a greedy or lexicographically optimal policy via exact or relaxed Bellman recursion (Gol et al., 2019, Işık et al., 2024).
Rollout and Simulation-Based Dynamic Programming: Large-scale, uncertainty-rich problems leverage rollout policies, which simulate forward under a computationally efficient base heuristic (e.g., index-based, local policies), using Monte Carlo to estimate downstream value and perform single-step or multistep policy improvement (Nozhati et al., 2018, Li et al., 21 Dec 2025).
Proximal Policy Optimization (PPO): High-dimensional, nonlinear, and continuous control environments (e.g., crew-based restoration, microgrid DER dispatch) utilize clipped policy-gradient algorithms such as PPO, with on-policy rollouts, entropy regularization, and advantage estimation (e.g., GAE) for stability and sample efficiency. These can be adapted to multi-agent and graph-structured settings (Dolatyabi et al., 18 Nov 2025, Maurer et al., 24 Jun 2025, Zhang et al., 2022).
Constrained Policy Optimization (CPO): For CMDPs involving hard physics (non-convex power flow, ESS complementarity, frequency regulation), CPO applies trust-region updates subject to first-order surrogate constraint satisfaction and KL-divergence bounds, typically with Gaussian policies and analytic constraint gradients (Bose et al., 2021).
Lexicographic DP Filtering: Restoration tasks with prioritized subgoals (e.g., critical load sets) use sequential DP-based action filtering to enforce multi-level goal reachability and minimize expected steps under multiple objectives (Işık et al., 2024).

3. Handling Domain-Specific Operational Constraints and Uncertainties

Restoration policies in real systems require both physics consistency and adaptation under uncertainty:

Physics-informed environment integration: Use of AC or DistFlow power flow solvers, differentiable penalty terms for constraint violations (e.g., voltage, thermal, DER cap), and reward design that encourages recovery from infeasible states rather than premature episode termination (Dolatyabi et al., 18 Nov 2025, Bose et al., 2021, Zhang et al., 2022).
Uncertainty modeling: Repair times, renewable generation, and load are sampled from realistic (e.g., exponential, Weibull) distributions, with fragility curves or real data used for event modeling (Nozhati et al., 2018, Li et al., 21 Dec 2025, Zhang et al., 2022, Jiang et al., 6 Aug 2025).
Handling risk attitudes: Flexible support for risk-neutral, risk-averse, or CVaR-based objectives by replacing mean scenario evaluation with extremal quantiles in downstream value estimation and optimization (Nozhati et al., 2018).

4. Multi-Agent and Coupled Network Extensions

Emerging frameworks address the increasing need for coordinated, distributed restoration:

Heterogeneous Multi-Agent PPO (HAPPO): Networks are partitioned into microgrids, each controlled by an agent with its own observation and policy network; a centralized critic derives global advantage estimates, enabling scalable, stable restoration under strong agent coupling and network heterogeneity (Dolatyabi et al., 18 Nov 2025).
Graph-NN–guided Assignment: In multi-crew/network–road restoration, joint graphs embed both power and transportation topologies. Graph neural networks parameterize RL policies that produce edge weights for optimal bigraph matching, efficiently allocating crews to repair tasks (Maurer et al., 24 Jun 2025).

5. Representative Implementations and Empirical Evaluation

Performance evaluation is conducted on large benchmark networks, with comprehensive ablation and robustness analysis:

Framework	Network/Test System	Policy Method	Result Highlights	Reference
HAPPO (HARL with PPO, centralized value)	IEEE 123-bus, 8500-node feeders	Multi-agent PPO	95.6–96.2% restored power, <35 ms decision, stable convergence	(Dolatyabi et al., 18 Nov 2025)
Rollout over index base policy	IEEE 123/8500-bus w/crews, mobiles	Online DP	24.8–31% cost reduction vs. base/MPC, minutes per step	(Li et al., 21 Dec 2025)
CPO (constrained policy optimization)	36/141-bus islanded MGs	Offline CPO	Matches/exceeds MPC on restoration, constraint satisfaction	(Bose et al., 2021)
GNN (PPO, bigraph matching)	IEEE 8500-bus, DFW network	GNN+PPO	$\sim$ 0.98 episode return, $10^5\times$ speedup over MIP	(Maurer et al., 24 Jun 2025)
Curriculum RL (PPO)	IEEE 33/123-bus DS restoration	RL with curriculum	>97% MPC value under perfect forecasts, robust to forecast errors	(Zhang et al., 2022)
Equity-conformalized RL	Real outage data, Tallahassee	ECQR, STA-SAC	3.6% reduction in average outage, 14.19% reduction in inequity	(Jiang et al., 6 Aug 2025)

6. Theoretical Guarantees, Limitations, and Future Directions

Policy-optimization-based restoration frameworks provide the following guarantees and caveats:

Monotonic policy improvement: Simulation-based (rollout) approaches with a base policy guarantee non-worsening expected performance at every refinement (Nozhati et al., 2018, Li et al., 21 Dec 2025).
Constraint satisfaction: CPO and penalty/shaped-reward approaches enforce or softly satisfy strict operational limits, crucial for physical infrastructures (Bose et al., 2021, Dolatyabi et al., 18 Nov 2025).
Scalability: Action/pruning heuristics, graph-based abstraction, and decentralized control achieve scalability to thousands of nodes/agents (Dolatyabi et al., 18 Nov 2025, Maurer et al., 24 Jun 2025).
Sample/data efficiency: Methods either eschew offline training (rollout) or employ curriculum learning, imitation, or actor-critic pretraining to accelerate RL convergence (Zhang et al., 2022, Li et al., 21 Dec 2025).

Limitations include sensitivity to quality of base policy (for rollout), dependence on scenario simulation fidelity, communication or partial observability in multi-agent settings, and AC power flow computational burden. Research directions include faster physics surrogates, distributed/MARL with communication latencies, multi-criteria and equitable restoration objectives, and extension to other infrastructure domains (water, telecom) (Dolatyabi et al., 18 Nov 2025, Li et al., 21 Dec 2025, Jiang et al., 6 Aug 2025).

7. Broader Impacts, Applications, and Extensions

Policy optimization-based restoration frameworks have demonstrated utility across operational planning for electric power and interdependent systems, including:

Distribution automation, crew/microgrid DER coordination, mobile resource deployment, and equitable outage mitigation (Dolatyabi et al., 18 Nov 2025, Li et al., 21 Dec 2025, Jiang et al., 6 Aug 2025).
Multi-utility and coupled infrastructure (power-water, power-transportation) scheduling (Nozhati et al., 2018, Maurer et al., 24 Jun 2025).
Transferability to other restoration/planning tasks subject to sequential, constrained, stochastic decision-making, with potential for generalization via graph neural policies and constrained RL (Işık et al., 2024, Maurer et al., 24 Jun 2025).

These frameworks, by integrating domain-specific physics, explicit uncertainty modeling, and rigorous policy-optimization algorithms, set the methodological foundation for resilient, efficient, and equitable infrastructure recovery.