- The paper introduces the PD-TD3 algorithm to address multi-network constrained operational optimization in integrated community energy systems with improved profit and constraint adherence.
- It employs a Lagrangian-based C-MDP formulation and twin-critic deep reinforcement learning to navigate non-convex device operations and complex network constraints.
- Results demonstrate faster convergence, effective policy learning, and enhanced modeling precision for safer, energy-efficient system operation.
This paper (2402.05412) presents a novel approach to address the operational optimization problem in Integrated Community Energy Systems (ICES) that are constrained by the physical characteristics of multiple energy networks (electricity, natural gas, heat). These systems are complex due to non-linear network constraints and the non-convex operating regions of energy devices like Combined Heat and Power (CHP) units. Traditional mathematical programming methods are computationally expensive and require full information, raising privacy concerns. Conventional reinforcement learning (RL) methods struggle with constrained optimization problems.
The paper proposes a Safe Reinforcement Learning (SRL) algorithm called Primal-Dual Twin Delayed Deep Deterministic Policy Gradient (PD-TD3) to solve this problem. The core idea is to formulate the constrained optimization problem as a Constrained Markov Decision Process (C-MDP) and then use a Lagrangian-based approach within a Deep Reinforcement Learning (DRL) framework to optimize for maximum profit while keeping constraint violations within a tolerated range.
System and Network Modeling:
The paper models an ICES where an operator (ICESO) serves multiple multi-energy users (MEUs). The ICESO manages distributed energy resources (DERs like PV and WT), energy storage systems (Electric Battery System - EBS, Thermal Energy Storage - TES), a CHP unit, and interacts with external wholesale electricity and gas markets. MEUs have integrated demand response (IDR) capabilities, allowing flexible consumption across electricity and heat.
Detailed models are provided for:
- CHP: Modeled with a non-convex feasible operation region, highlighting its crucial role in efficiency but contributing to problem complexity.
- DERs (PV, WT): Modeled with power output uncertainty using probabilistic distributions.
- EBS and TES: Modeled as charge/dischargeable storage systems with state of charge dynamics.
- MEUs: Modeled with quadratic utility functions for energy consumption and equality/inequality constraints for their devices (Electric Boiler - EB, Gas Boiler - GB). Their objective is to maximize utility minus energy purchase costs.
The integrated energy networks are modeled considering physical constraints:
- Electricity Distribution Network: Modeled as a radial network using a linearized DistFlow approach, including constraints on real power, reactive power, and voltage.
- Natural Gas Network: Modeled with unidirectional flow using the Weymouth equation. This introduces a non-convex constraint. Pressure and flow limits are included.
- District Heating Network: Modeled using the Variable Flow Temperature Constant (VFTC) method for supply and return pipelines, considering water flow, heat injection, temperature, and pressure dynamics.
Problem Formulation (C-MDP):
The ICESO's operational optimization problem is formulated to maximize total profits (revenue from selling energy to MEUs minus costs from purchasing from wholesale markets and energy imbalance penalties), subject to:
- Device operational constraints (CHP, DERs, EBS, TES).
- Energy balance equations for electricity, heat, and natural gas.
- Retail energy price limits.
- Physical constraints of the electricity, natural gas, and heat networks (voltage limits, gas flow/pressure limits, heat flow limits).
The paper transforms this into a C-MDP < S, A, [R](https://www.emergentmind.com/topics/f-r-gravity-models), C, P, u, y >, where:
- State (S): Wholesale market prices (electricity, gas), forecast DER generation (WT, PV).
- Action (A): ICESO's decisions, including retail energy prices for MEUs and operational schedules for energy devices (CHP, EBS, TES).
- Reward (R): The total profit of the ICESO.
- Cost (C): A function quantifying the violation of physical network constraints. This includes violations of electricity voltage limits, natural gas flow/pressure limits, and heat flow limits.
- Policy (u): The ICESO's strategy mapping states to actions.
- Constraint: The long-term discounted cost must be below a specified threshold d, i.e., C(u)≤d.
The objective is to find a policy u∗ that maximizes the expected cumulative reward R(u) while satisfying the cost constraint C(u)≤d.
PD-TD3 Algorithm for Implementation:
The proposed PD-TD3 algorithm tackles the C-MDP by converting it into an unconstrained min-max problem using a Lagrangian formulation: L(u,λ)=R(u)−λ(C(u)−d). The algorithm uses an iterative primal-dual update process to find the optimal policy u∗ (primal variable) and Lagrangian multiplier λ∗ (dual variable).
Key implementation details of PD-TD3:
- Actor-Critic Architecture: Uses Deep Neural Networks for the actor (policy network μ) and critics (Q-networks).
- Twin Critics: Employs two sets of Q-networks: one for estimating the reward value (QR) and another for estimating the cost value (QC). Each type of critic has two online networks and two target networks, following the TD3 approach to mitigate overestimation of both reward and cost Q-values.
- Reward Critics: QR1(s,a∣θR1), QR2(s,a∣θR2) and targets QR1′(s,a∣θR1′), QR2′(s,a∣θR2′).
- Cost Critics: QC1(s,a∣θC1), QC2(s,a∣θC2) and targets QC1′(s,a∣θC1′), QC2′(s,a∣θC2′).
- Actor: μ(s∣θμ) and target μ′(s∣θμ′).
- Target Computation: Target values for the critics (yi for reward, zi for cost) are computed using the minimum of the twin target critic networks applied to the next state and a target action derived from the target policy network with added clipped noise (target policy smoothing).
- yi=ri+γj∈{1,2}minQRj′(si+1,a~i+1∣θRj′)
- zi=ci+γj∈{1,2}minQCj′(si+1,a~i+1∣θCj′)
- a~i+1=μ′(si+1∣θμ′)+ϵ, ϵ∼clip(N(0,σ),−c,c)
- Critic Updates: The online reward and cost critic networks are updated using gradient descent to minimize the Mean Squared Error (MSE) between their predictions and the computed targets.
- LR=E[(yi−QR1(si,ai∣θR1))2+(yi−QR2(si,ai∣θR2))2]
- LC=E[(zi−QC1(si,ai∣θC1))2+(zi−QC2(si,ai∣θC2))2]
- Delayed Policy and Multiplier Updates: The policy network and Lagrangian multiplier λ are updated less frequently than the critics (every ϵ steps). This delayed update helps stabilize training.
- Policy Update (gradient ascent on the Lagrangian): ∇θμL≈E[∇a(QR1(si,a∣θR1)−λQC1(si,a∣θC1))∣a=μ(si∣θμ)∇θμμ(si∣θμ)] (using only the first critic for the policy gradient, like TD3).
- Multiplier Update: λk+1=[λk+βk(E[C(uk)]−d)]+, where E[C(uk)] is estimated using the sampled cost and the minimum of the target cost Q-values: j∈{1,2}minQCj′(si+1,a~i+1∣θCj′). The [.]+ projection ensures λ≥0.
- Target Network Updates: Target networks are updated softly (slowly) towards the online networks: θ′←ρθ+(1−ρ)θ′.
- Experience Replay Buffer: Transitions (st,at,rt,ct,st+1) are stored and sampled in batches for training.
Practical Implementation and Case Study:
The algorithm is tested on a system consisting of an IEEE-33 bus electricity network, a 10-node heat network, and a natural gas network, serving 5 MEUs. Real-world data for market prices and DER generation is used. The optimization horizon is 24 hours, divided into hourly intervals. Neural network architectures and hyperparameters are provided (e.g., 2 hidden layers with [128, 32] neurons, Adam optimizer, learning rates 4e-4 for actor, 7e-4 for critic).
The results demonstrate:
- Superior Performance: PD-TD3 achieves higher cumulative rewards (profits) and keeps constraint violations within the allowable range compared to benchmark SRL algorithms (L-SAC, S-DDPG, TD3 with fixed penalties). L-SAC is overly conservative, and S-DDPG violates constraints more frequently.
- Faster Convergence: PD-TD3 converges faster than S-DDPG and similarly to L-SAC for reward convergence, while converging faster for cost satisfaction than S-DDPG.
- Effective Operational Policies: PD-TD3 learns policies that lead to smoother electricity consumption curves, higher ICES prices, less reliance on the external market, and strategic CHP operation based on demands and market prices, demonstrating energy-efficient operation while respecting network constraints.
- Importance of Modeling: Using a detailed, non-convex CHP model significantly impacts operational decisions and leads to substantially higher profits compared to a simplified linear model, highlighting the need for accurate physical modeling.
- Hyperparameter Sensitivity: Sensitivity analysis shows that learning rates for actor and critic networks critically influence the reward-cost trade-off and convergence speed, emphasizing the importance of tuning.
Implementation Considerations:
- Computational Requirements: DRL requires significant computational resources, especially for training DNNs and managing the experience replay buffer. The size and complexity of the ICES and networks will impact training time.
- Data Requirements: Training relies on sufficient data generated from interactions with the environment (simulated or real). Accurate models of devices, networks, demands, and market prices are crucial.
- State and Action Space Design: Defining appropriate state and action spaces that capture the relevant information for decision-making is critical. The paper uses wholesale prices and DER forecasts as states, and retail prices and device schedules as actions.
- Cost Function Design: Properly defining the cost function to quantify network constraint violations is key to guiding the safe learning process. Standardizing violations (as done in the paper) helps balance different types of constraints.
- Hyperparameter Tuning: SRL algorithms, like DRL in general, are sensitive to hyperparameters (learning rates, discount factor, target update parameters, noise parameters, Lagrangian step size, cost threshold d). Careful tuning is required for optimal performance and safe convergence.
- Convergence and Stability: The delayed updates and double Q-networks in PD-TD3 are specific techniques used to improve training stability and prevent overestimation, which are common challenges in DRL/SRL.
- Scalability: Applying this to larger and more complex ICES with numerous devices and network nodes might require more complex neural network architectures and distributed training setups.
The paper concludes that PD-TD3 is a promising approach for solving complex, network-constrained operational optimization problems in ICES, achieving a good balance between economic profits and operational safety. Future work includes integrating carbon emission trading into the objective.