Techno-Economic Modeling and Safe Operational Optimization of Multi-Network Constrained Integrated Community Energy Systems

Published 8 Feb 2024 in eess.SY and cs.SY | (2402.05412v2)

Abstract: The integrated community energy system (ICES) has emerged as a promising solution for enhancing the efficiency of the distribution system by effectively coordinating multiple energy sources. However, the operational optimization of ICES is hindered by the physical constraints of heterogeneous networks including electricity, natural gas, and heat. These challenges are difficult to address due to the non-linearity of network constraints and the high complexity of multi-network coordination. This paper, therefore, proposes a novel Safe Reinforcement Learning (SRL) algorithm to optimize the multi-network constrained operation problem of ICES. Firstly, a comprehensive ICES model is established considering integrated demand response (IDR), multiple energy devices, and network constraints. The multi-network operational optimization problem of ICES is then presented and reformulated as a constrained Markov Decision Process (C-MDP) accounting for violating physical network constraints. The proposed novel SRL algorithm, named Primal-Dual Twin Delayed Deep Deterministic Policy Gradient (PD-TD3), solves the C-MDP by employing a Lagrangian multiplier to penalize the multi-network constraint violation, ensuring that violations are within a tolerated range and avoid over-conservative strategy with a low reward at the same time. The proposed algorithm accurately estimates the cumulative reward and cost of the training process, thus achieving a fair balance between improving profits and reducing constraint violations in a privacy-protected environment with only partial information. A case study comparing the proposed algorithm with benchmark RL algorithms demonstrates the computational performance in increasing total profits and alleviating the network constraint violations.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces the PD-TD3 algorithm to address multi-network constrained operational optimization in integrated community energy systems with improved profit and constraint adherence.
It employs a Lagrangian-based C-MDP formulation and twin-critic deep reinforcement learning to navigate non-convex device operations and complex network constraints.
Results demonstrate faster convergence, effective policy learning, and enhanced modeling precision for safer, energy-efficient system operation.

This paper (2402.05412) presents a novel approach to address the operational optimization problem in Integrated Community Energy Systems (ICES) that are constrained by the physical characteristics of multiple energy networks (electricity, natural gas, heat). These systems are complex due to non-linear network constraints and the non-convex operating regions of energy devices like Combined Heat and Power (CHP) units. Traditional mathematical programming methods are computationally expensive and require full information, raising privacy concerns. Conventional reinforcement learning (RL) methods struggle with constrained optimization problems.

The paper proposes a Safe Reinforcement Learning (SRL) algorithm called Primal-Dual Twin Delayed Deep Deterministic Policy Gradient (PD-TD3) to solve this problem. The core idea is to formulate the constrained optimization problem as a Constrained Markov Decision Process (C-MDP) and then use a Lagrangian-based approach within a Deep Reinforcement Learning (DRL) framework to optimize for maximum profit while keeping constraint violations within a tolerated range.

System and Network Modeling:

The paper models an ICES where an operator (ICESO) serves multiple multi-energy users (MEUs). The ICESO manages distributed energy resources (DERs like PV and WT), energy storage systems (Electric Battery System - EBS, Thermal Energy Storage - TES), a CHP unit, and interacts with external wholesale electricity and gas markets. MEUs have integrated demand response (IDR) capabilities, allowing flexible consumption across electricity and heat.

Detailed models are provided for:

CHP: Modeled with a non-convex feasible operation region, highlighting its crucial role in efficiency but contributing to problem complexity.
DERs (PV, WT): Modeled with power output uncertainty using probabilistic distributions.
EBS and TES: Modeled as charge/dischargeable storage systems with state of charge dynamics.
MEUs: Modeled with quadratic utility functions for energy consumption and equality/inequality constraints for their devices (Electric Boiler - EB, Gas Boiler - GB). Their objective is to maximize utility minus energy purchase costs.

The integrated energy networks are modeled considering physical constraints:

Electricity Distribution Network: Modeled as a radial network using a linearized DistFlow approach, including constraints on real power, reactive power, and voltage.
Natural Gas Network: Modeled with unidirectional flow using the Weymouth equation. This introduces a non-convex constraint. Pressure and flow limits are included.
District Heating Network: Modeled using the Variable Flow Temperature Constant (VFTC) method for supply and return pipelines, considering water flow, heat injection, temperature, and pressure dynamics.

Problem Formulation (C-MDP):

The ICESO's operational optimization problem is formulated to maximize total profits (revenue from selling energy to MEUs minus costs from purchasing from wholesale markets and energy imbalance penalties), subject to:

Device operational constraints (CHP, DERs, EBS, TES).
Energy balance equations for electricity, heat, and natural gas.
Retail energy price limits.
Physical constraints of the electricity, natural gas, and heat networks (voltage limits, gas flow/pressure limits, heat flow limits).

The paper transforms this into a C-MDP < S, A, [R](https://www.emergentmind.com/topics/f-r-gravity-models), C, P, u, y >, where:

State (S): Wholesale market prices (electricity, gas), forecast DER generation (WT, PV).
Action (A): ICESO's decisions, including retail energy prices for MEUs and operational schedules for energy devices (CHP, EBS, TES).
Reward (R): The total profit of the ICESO.
Cost (C): A function quantifying the violation of physical network constraints. This includes violations of electricity voltage limits, natural gas flow/pressure limits, and heat flow limits.
Policy (u): The ICESO's strategy mapping states to actions.
Constraint: The long-term discounted cost must be below a specified threshold $d$ , i.e., $C(u) \leq d$ .

The objective is to find a policy $u^*$ that maximizes the expected cumulative reward $R(u)$ while satisfying the cost constraint $C(u) \leq d$ .

PD-TD3 Algorithm for Implementation:

The proposed PD-TD3 algorithm tackles the C-MDP by converting it into an unconstrained min-max problem using a Lagrangian formulation: $\mathcal{L}(u, \lambda) = R(u) - \lambda(C(u) - d)$ . The algorithm uses an iterative primal-dual update process to find the optimal policy $u^*$ (primal variable) and Lagrangian multiplier $\lambda^*$ (dual variable).

Key implementation details of PD-TD3:

Actor-Critic Architecture: Uses Deep Neural Networks for the actor (policy network $\mu$ ) and critics (Q-networks).
Twin Critics: Employs two sets of Q-networks: one for estimating the reward value ( $Q_R$ $Q_{R}$ ) and another for estimating the cost value ( $Q_C$ $Q_{C}$ ). Each type of critic has two online networks and two target networks, following the TD3 approach to mitigate overestimation of both reward and cost Q-values.
- Reward Critics: $Q_{R1}(s, a|\theta_{R1})$ , $Q_{R2}(s, a|\theta_{R2})$ and targets $Q'_{R1}(s, a|\theta'_{R1})$ , $Q'_{R2}(s, a|\theta'_{R2})$ .
- Cost Critics: $Q_{C1}(s, a|\theta_{C1})$ , $Q_{C2}(s, a|\theta_{C2})$ and targets $Q'_{C1}(s, a|\theta'_{C1})$ , $Q'_{C2}(s, a|\theta'_{C2})$ .
- Actor: $\mu(s|\theta^\mu)$ and target $\mu'(s|\theta^{\mu'})$ .
Target Computation: Target values for the critics ( $y_i$ $y_{i}$ for reward, $z_i$ $z_{i}$ for cost) are computed using the minimum of the twin target critic networks applied to the next state and a target action derived from the target policy network with added clipped noise (target policy smoothing).
- $y_i = r_i + \gamma \min_{j \in \{1,2\}} Q'_{Rj}(s_{i+1}, \tilde{a}_{i+1}|\theta'_{Rj})$
- $z_i = c_i + \gamma \min_{j \in \{1,2\}} Q'_{Cj}(s_{i+1}, \tilde{a}_{i+1}|\theta'_{Cj})$
- $\tilde{a}_{i+1} = \mu'(s_{i+1}|\theta^{\mu'}) + \epsilon$ , $\epsilon \sim \text{clip}(\mathcal{N}(0, \sigma), -c, c)$
Critic Updates: The online reward and cost critic networks are updated using gradient descent to minimize the Mean Squared Error (MSE) between their predictions and the computed targets.
- $L_R = \mathbb{E}[(y_i - Q_{R1}(s_i, a_i|\theta_{R1}))^2 + (y_i - Q_{R2}(s_i, a_i|\theta_{R2}))^2]$
- $L_C = \mathbb{E}[(z_i - Q_{C1}(s_i, a_i|\theta_{C1}))^2 + (z_i - Q_{C2}(s_i, a_i|\theta_{C2}))^2]$
Delayed Policy and Multiplier Updates: The policy network and Lagrangian multiplier $\lambda$ $λ$ are updated less frequently than the critics (every $\epsilon$ $ϵ$ steps). This delayed update helps stabilize training.
- Policy Update (gradient ascent on the Lagrangian): $\nabla_{\theta^\mu} \mathcal{L} \approx \mathbb{E}[\nabla_a (Q_{R1}(s_i, a|\theta_{R1}) - \lambda Q_{C1}(s_i, a|\theta_{C1}))|_{a=\mu(s_i|\theta^\mu)} \nabla_{\theta^\mu} \mu(s_i|\theta^\mu)]$ (using only the first critic for the policy gradient, like TD3).
- Multiplier Update: $\lambda_{k+1} = [\lambda_k + \beta_k (\mathbb{E}[C(u_k)] - d)]_+$ , where $\mathbb{E}[C(u_k)]$ is estimated using the sampled cost and the minimum of the target cost Q-values: $\min_{j \in \{1,2\}} Q'_{Cj}(s_{i+1}, \tilde{a}_{i+1}|\theta'_{Cj})$ . The $[.]_+$ projection ensures $\lambda \ge 0$ .
Target Network Updates: Target networks are updated softly (slowly) towards the online networks: $\theta' \leftarrow \rho \theta + (1-\rho) \theta'$ .
Experience Replay Buffer: Transitions $(s_t, a_t, r_t, c_t, s_{t+1})$ are stored and sampled in batches for training.

Practical Implementation and Case Study:

The algorithm is tested on a system consisting of an IEEE-33 bus electricity network, a 10-node heat network, and a natural gas network, serving 5 MEUs. Real-world data for market prices and DER generation is used. The optimization horizon is 24 hours, divided into hourly intervals. Neural network architectures and hyperparameters are provided (e.g., 2 hidden layers with [128, 32] neurons, Adam optimizer, learning rates 4e-4 for actor, 7e-4 for critic).

The results demonstrate:

Superior Performance: PD-TD3 achieves higher cumulative rewards (profits) and keeps constraint violations within the allowable range compared to benchmark SRL algorithms (L-SAC, S-DDPG, TD3 with fixed penalties). L-SAC is overly conservative, and S-DDPG violates constraints more frequently.
Faster Convergence: PD-TD3 converges faster than S-DDPG and similarly to L-SAC for reward convergence, while converging faster for cost satisfaction than S-DDPG.
Effective Operational Policies: PD-TD3 learns policies that lead to smoother electricity consumption curves, higher ICES prices, less reliance on the external market, and strategic CHP operation based on demands and market prices, demonstrating energy-efficient operation while respecting network constraints.
Importance of Modeling: Using a detailed, non-convex CHP model significantly impacts operational decisions and leads to substantially higher profits compared to a simplified linear model, highlighting the need for accurate physical modeling.
Hyperparameter Sensitivity: Sensitivity analysis shows that learning rates for actor and critic networks critically influence the reward-cost trade-off and convergence speed, emphasizing the importance of tuning.

Implementation Considerations:

Computational Requirements: DRL requires significant computational resources, especially for training DNNs and managing the experience replay buffer. The size and complexity of the ICES and networks will impact training time.
Data Requirements: Training relies on sufficient data generated from interactions with the environment (simulated or real). Accurate models of devices, networks, demands, and market prices are crucial.
State and Action Space Design: Defining appropriate state and action spaces that capture the relevant information for decision-making is critical. The paper uses wholesale prices and DER forecasts as states, and retail prices and device schedules as actions.
Cost Function Design: Properly defining the cost function to quantify network constraint violations is key to guiding the safe learning process. Standardizing violations (as done in the paper) helps balance different types of constraints.
Hyperparameter Tuning: SRL algorithms, like DRL in general, are sensitive to hyperparameters (learning rates, discount factor, target update parameters, noise parameters, Lagrangian step size, cost threshold $d$ ). Careful tuning is required for optimal performance and safe convergence.
Convergence and Stability: The delayed updates and double Q-networks in PD-TD3 are specific techniques used to improve training stability and prevent overestimation, which are common challenges in DRL/SRL.
Scalability: Applying this to larger and more complex ICES with numerous devices and network nodes might require more complex neural network architectures and distributed training setups.

The paper concludes that PD-TD3 is a promising approach for solving complex, network-constrained operational optimization problems in ICES, achieving a good balance between economic profits and operational safety. Future work includes integrating carbon emission trading into the objective.

Markdown