Papers
Topics
Authors
Recent
Search
2000 character limit reached

DRL ESGAgents for Sustainable Investing

Updated 9 February 2026
  • Deep Reinforcement Learning (DRL) ESGAgents are autonomous systems that integrate ESG metrics into MDPs to jointly optimize financial returns and sustainability.
  • They employ algorithms like A2C and PPO with tailored reward functions, including grant/tax incentives, to guide investments toward high-ESG companies.
  • Empirical studies across markets such as DJIA and NASDAQ-100 demonstrate that ESG-enabled DRL agents can achieve competitive risk-adjusted returns while supporting sustainable investment.

Deep Reinforcement Learning (DRL) ESGAgents are autonomous portfolio management systems developed to jointly optimize financial return and sustainability objectives as quantified by Environment, Social, and Governance (ESG) scores. These agents integrate ESG metrics directly into the Markov Decision Process (MDP) and reward structure to incentivize investment strategies that favor companies with higher ESG performance, enabling data-driven sustainable investing without degrading classical risk-adjusted returns (Garrido-Merchán et al., 2023, Garrido-Merchán et al., 17 Dec 2025).

1. Formulation: MDPs for ESG-Aware Portfolio Management

ESGAgent architectures cast the financial portfolio management task as an MDP (S,A,P,R,γ)(\mathcal{S},\mathcal{A},P,R,\gamma). At each trading time step tt, the agent observes the environment state stSs_t \in \mathcal{S} constructed from:

  • Market data: OHCLV vectors for each asset j=1,,Tj=1,\dots,T, where OHCLVt=[Opent,j,Hight,j,Lowt,j,Closet,j,Volumet,j]OHCLV_t = [Open_{t,j}, High_{t,j}, Low_{t,j}, Close_{t,j}, Volume_{t,j}].
  • Technical indicators: A vector of indicators such as {MACDt,BOLLt,RSIt,CCIt,DXt,SMAt}\{MACD_t, BOLL_t, RSI_t, CCI_t, DX_t, SMA_t\} per asset.
  • ESG scores: Asset-specific ESG scores ϵt,j[0,10]\epsilon_{t,j} \in [0,10] and portfolio-level mean ESG.

The combined state is st=[OHCLVt,technicalt,ϵt]s_t = [OHCLV_t, technical_t, \epsilon_t]. The action ata_t is a continuous vector wt=[wt,1,,wt,T]w_t = [w_{t,1},\dots,w_{t,T}], representing portfolio weights constrained to jwt,j=1\sum_j w_{t,j} = 1, wt,j0w_{t,j}\ge0.

Reward functions differ between settings. In standard financial DRL, the reward is the daily portfolio return:

rt=j=1Twt,j(pt,jpt1,j),r_t = \sum_{j=1}^T w_{t,j}(p_{t,j} - p_{t-1,j}),

where pt,jp_{t,j} is the price of asset jj at tt.

For ESG regulation, the reward Rt\mathcal{R}_t includes a grant or tax proportional to the portfolio’s outperformance in ESG relative to the benchmark mean, parameterized by a grant/tax strength λ\lambda:

  • If the portfolio mean ESG φt\varphi_t exceeds the index mean ESG ψt\psi_t, a grant is applied:

Rt=rt+λrtφtψt10ψt,\mathcal R_t = r_t + \lambda|r_t|\frac{\varphi_t-\psi_t}{10-\psi_t},

  • Otherwise, a tax is incurred:

Rt=rtλrtψtφtψt.\mathcal R_t = r_t - \lambda|r_t|\frac{\psi_t-\varphi_t}{\psi_t}.

2. Agent Architectures and Reinforcement Learning Algorithms

Two major DRL algorithms define the ESGAgent family:

  • Advantage Actor–Critic (A2C): Employs separate actor and critic networks (2-layer MLPs, 64 ReLU units/layer). The actor πθ(atst)\pi_\theta(a_t|s_t) outputs portfolio weights, while the critic Vw(st)V_w(s_t) estimates value. The policy loss incorporates an entropy regularizer for policy exploration.
  • Proximal Policy Optimization (PPO): Utilizes a joint neural network with shared base layers and actor/critic heads (also 2×64 MLP). The key loss is the clipped surrogate objective, which stabilizes updates via probability ratio clipping. Additional components include value and entropy losses.

Hyperparameters are set empirically (learning rates 104\sim 10^{-4}, entropy bonuses cent0.005c_{ent}\approx 0.005, discount γ=0.99\gamma=0.99). Batch and update schemes are detailed for reproducibility (Garrido-Merchán et al., 2023).

3. Multi-Objective Optimization and Hyperparameter Selection

Recent advances (Garrido-Merchán et al., 17 Dec 2025) formulate portfolio management as a multi-objective optimization over both risk-return (Sharpe ratio) and sustainability (mean ESG score). The agent’s combined reward at each time step is:

rt=αrtSR+(1α)rtESG,α[0,1],r_t = \alpha r_t^{SR} + (1-\alpha)r_t^{ESG}, \quad \alpha\in[0,1],

where rtSRr_t^{SR} and rtESGr_t^{ESG} are stepwise mappings of daily return and ESG, respectively. The scalarization coefficient α\alpha modulates the trade-off.

  • Multi-Objective Bayesian Optimization (MOBO): Hyperparameter tuning (including α\alpha) uses Gaussian Process surrogates for both the Sharpe ratio f1f_1 and mean portfolio ESG f2f_2. The Expected Hypervolume Improvement (EHVI) acquisition function sequentially proposes new hyperparameters to maximize coverage of the Pareto front:

xf(x)=(SR(πθ(x)),mean_ESG(πθ(x))),x \mapsto f(x) = (SR(\pi_\theta(x)), mean\_ESG(\pi_\theta(x))),

where xx is the vector of DRL hyperparameters. Parallel and random search baselines are compared.

  • Pareto Frontier and Hypervolume Metrics: The non-dominated set FF of (SR,ESG)(SR, ESG) results is reported, and its hypervolume quantifies trade-off quality.

4. Experimental Evaluation Across Markets

Empirical validation involves both single-objective and multi-objective DRL-ESGAgents, using OpenAI Gym environments wrapped by FinRL and ESG data from Bloomberg.

  • Asset universes: DJIA-30, NASDAQ-100, IBEX-35.
  • Data: Daily OHLCV, monthly ESG, technicals; training 2008–2022, test 2023 (DJIA/NASDAQ) (Garrido-Merchán et al., 17 Dec 2025).
  • Metrics: Cumulative return, Sharpe/Sortino/Calmar ratios, max drawdown, volatility, Omega ratio.

Key reported results:

Market Agent Type Cumulative Return Sharpe Ratio Mean ESG Score
DJIA ESG-DRL 16.58% 2.15
NASDAQ-100 ESG-DRL 45.80% 1.78 2.95
IBEX-35 ESG-DRL 8.6% 0.88
  • In most settings, ESG-regulated DRL agents match or modestly exceed “free-market” DRL agents’ risk-adjusted returns (Garrido-Merchán et al., 2023).
  • Multi-objective optimization yields Pareto frontiers spanning Sharpe and mean ESG; hypervolume under BO exceeds random search by an order of magnitude (DJIA: HV≈161.8 BO vs ≈6.1 Random) (Garrido-Merchán et al., 17 Dec 2025).

5. Implications, Limitations, and Practical Insights

  • Behavioral Impact: ESG reward augmentations consistently tilt the DRL policy towards higher-ESG portfolios without return sacrifice, confirming that sustainability incentives are actionable and do not degrade risk/return profiles in US markets.
  • Generalization: The Markov Decision Process and agent architectures generalize across market indices without modification, except for data preparation (e.g., ESG imputation for IBEX) (Garrido-Merchán et al., 2023).
  • Hyperparameter Sensitivity: Agent performance is highly sensitive to hyperparameters. Multi-objective Bayesian optimization is critical for discovering robust risk/ESG trade-offs in noisy, expensive evaluation regimes (Garrido-Merchán et al., 17 Dec 2025).
  • Interpretability: Investors may select portfolios along the empirically derived Pareto frontier to suit individual preferences for risk and sustainability.
  • Implementation: The architecture is built atop FinRL and OpenAI Gym, with reproducibility pointers including hyperparameter ranges and network details in the cited works.

Limitations include minor underperformance of ESG variants in sparser data regimes (e.g., IBEX), and the fact that maximizing either Sharpe or ESG exclusively is still possible by reward scalarization, as validated through single-objective ablations (Garrido-Merchán et al., 17 Dec 2025).

6. Summary and Prospects

Deep Reinforcement Learning ESGAgents represent a systematic approach to unified financial and sustainability portfolio optimization. Through principled reward engineering (linear grant/tax or convex reward combination) and integrated ESG data, these agents dynamically align portfolio learning with both market and ESG objectives. Empirical evidence from DJIA, NASDAQ-100, and IBEX-35 indicates that such regulation does not diminish performance and often improves risk-adjusted metrics, with multi-objective Bayesian optimization enabling efficient discovery of Pareto-optimal risk–sustainability portfolios. These results suggest the viability of next-generation robo-advisors that operationalize ESG considerations within their fundamental decision-making processes, rather than as after-the-fact overlays (Garrido-Merchán et al., 2023, Garrido-Merchán et al., 17 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Reinforcement Learning (DRL) ESGAgents.