Multi-agent Reinforcement Learning for Low-Carbon P2P Energy Trading among Self-Interested Microgrids

Published 10 Apr 2026 in cs.MA | (2604.08973v1)

Abstract: Uncertainties in renewable generation and demand dynamics challenge day-ahead scheduling. To enhance renewable penetration and maintain intra-day balance, we develop a multi-agent reinforcement learning framework for self-interested microgrids participating in peer-to-peer (P2P) electricity trading. Each microgrid independently bids both price and quantity while optimizing its own profit via storage arbitrage under time-varying main-grid prices. A market-clearing mechanism coordinating trades and promoting incentive compatibility is proposed. Simulation results show that the learned bidding policy improves renewable utilization and reduces reliance on high-carbon electricity, while increasing community-level economic welfare, delivering a win-win situation in emission reduction and local prosperity.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper proposes a multi-agent reinforcement learning (MARL) framework using MMAPPO that enables self-interested microgrids to optimize bidding in a decentralized, low-carbon P2P energy market.
It employs an LSTM-augmented actor network and a multi-round double auction clearing mechanism, achieving superior profitability and reduced high-carbon emergency power reliance.
Experimental results on realistic microgrid simulations demonstrate rapid convergence, enhanced renewable energy utilization, and robust policy performance under market uncertainties.

Multi-Agent Reinforcement Learning for Low-Carbon Peer-to-Peer Energy Trading Among Self-Interested Microgrids

Problem Formulation and Theoretical Framework

This work addresses the challenge of optimizing intra-day electricity trading among self-interested microgrids in the context of highly uncertain renewable generation and load profiles. The microgrids, each with local PV generation and energy storage, participate in a two-stage market: an initial day-ahead scheduling phase and a real-time peer-to-peer (P2P) trading phase. Due to forecast errors and volatility, the day-ahead allocation is insufficient to guarantee operational or economic optimality in real time, motivating adaptive intra-day mechanisms.

The P2P market is modeled as a decentralized, partially observable Markov decision process (DEC-POMDP), where each microgrid aims to maximize its own profit, considering storage arbitrage, flexible bidding in terms of both price and quantity, and overall main-grid price fluctuations. The global objective includes maximizing economic welfare while reducing aggregate carbon emissions, consistent with emission-related market signals (e.g., penalty for high-carbon emergency purchases, main-grid feed-in tariffs).

MARL-Based Bidding Strategy

The core contribution is a multi-agent reinforcement learning (MARL) framework under centralized training and decentralized execution (CTDE). Each microgrid learns an autonomous policy, parameterized by an LSTM-augmented actor network, to bid optimally in each P2P trading interval. The MARL variant employed is Multi-Agent Proximal Policy Optimization (MAPPO) with LSTM-based temporal abstraction (MMAPPO), providing robust policy convergence under non-stationary stochastic market environments and allowing the extraction of multi-scale temporal correlations from local and market data.

The market-clearing follows the Multi-Round Double Auction Clearing (MRDAC) mechanism, which balances incentive compatibility, market efficiency, and practical tractability given dynamic supply-demand mismatches and limited agent observability. Matching and settlement mechanisms prioritize lower-price sellers and higher-price buyers, aligning individual profit incentives with community-level emission reduction.

Experimental Protocol and Numerical Findings

The study conducts simulations on a testbed of four heterogeneous microgrids, each instantiated with realistic load and PV profiles derived from residential datasets. Price signals, storage constraints, and the stochasticity of both local demand and renewable supply are explicitly modeled. The MMAPPO algorithm's performance is benchmarked against strong RL baselines (MIPPO, MAPPO-one, MAPPO-s) and examined under alternative market-clearing mechanisms (VDA, Greedy).

Key quantitative results include:

MMAPPO+MRDAC attains the best profitability ( $-123.81$ /day) and lowest emergency power purchase, outperforming VDA (by 50%) and Greedy (by 78%) in simulated steady-state.
The MARL approach demonstrates fast convergence and favorable trade-off between exploitation and exploration in policy optimization.
Renewable energy utilization and P2P trading volume are maximized under MMAPPO, alongside minimized reliance on high-carbon grid purchases.

Such results indicate that market design (auction mechanism) and advanced distributed RL both critically impact operational and carbon efficiency in distributed energy trading.

Implications and Future Developments

From a practical standpoint, this framework demonstrates that MARL techniques, when combined with adaptive market designs, can solve fully decentralized, nonconvex, and multi-objective coordination problems in P2P energy systems. The explicit modeling of both individual self-interest and system-level low-carbon incentives is significant, as it reflects realistic deployment scenarios in deregulated energy markets.

The theoretical implication is that RL agents can learn coordinated, equilibrium-adjacent policies in complex auction-based markets without access to complete information or centralized optimization. This decreases the dependence on analytically tractable but overly restrictive modeling assumptions frequent in energy market literature.

The study does not explicitly incorporate physical power-flow constraints, but the proposed multi-agent RL market model can be extended to incorporate network restrictions or additional layers of market interaction (e.g., joint carbon and energy markets (2604.08973, Huang et al., 2023)). Scalability to larger, more heterogeneous networks and domain adaptation to fault-prone renewables are tractable future research directions. The integration of federated or privacy-preserving RL and distributed ledger technologies could allow coordination under data locality and security constraints.

Conclusion

This paper formulates and solves autonomous, low-carbon P2P energy trading among self-interested microgrids as a MARL-enabled, market-based optimization problem. The MMAPPO-based bidding mechanism, coupled with an efficient double auction clearing process, demonstrates strong improvements in both economic performance and emission metrics under uncertainty. These findings support the deployment of multi-agent learning frameworks as core optimization engines in the evolution of smart, sustainable distribution systems.

Markdown Report Issue