DRL ESGAgents for Sustainable Investing

Updated 9 February 2026

Deep Reinforcement Learning (DRL) ESGAgents are autonomous systems that integrate ESG metrics into MDPs to jointly optimize financial returns and sustainability.
They employ algorithms like A2C and PPO with tailored reward functions, including grant/tax incentives, to guide investments toward high-ESG companies.
Empirical studies across markets such as DJIA and NASDAQ-100 demonstrate that ESG-enabled DRL agents can achieve competitive risk-adjusted returns while supporting sustainable investment.

Deep Reinforcement Learning (DRL) ESGAgents are autonomous portfolio management systems developed to jointly optimize financial return and sustainability objectives as quantified by Environment, Social, and Governance (ESG) scores. These agents integrate ESG metrics directly into the Markov Decision Process (MDP) and reward structure to incentivize investment strategies that favor companies with higher ESG performance, enabling data-driven sustainable investing without degrading classical risk-adjusted returns (Garrido-Merchán et al., 2023, Garrido-Merchán et al., 17 Dec 2025).

1. Formulation: MDPs for ESG-Aware Portfolio Management

ESGAgent architectures cast the financial portfolio management task as an MDP $(\mathcal{S},\mathcal{A},P,R,\gamma)$ . At each trading time step $t$ , the agent observes the environment state $s_t \in \mathcal{S}$ constructed from:

Market data: OHCLV vectors for each asset $j=1,\dots,T$ , where $OHCLV_t = [Open_{t,j}, High_{t,j}, Low_{t,j}, Close_{t,j}, Volume_{t,j}]$ .
Technical indicators: A vector of indicators such as $\{MACD_t, BOLL_t, RSI_t, CCI_t, DX_t, SMA_t\}$ per asset.
ESG scores: Asset-specific ESG scores $\epsilon_{t,j} \in [0,10]$ and portfolio-level mean ESG.

The combined state is $s_t = [OHCLV_t, technical_t, \epsilon_t]$ . The action $a_t$ is a continuous vector $w_t = [w_{t,1},\dots,w_{t,T}]$ , representing portfolio weights constrained to $\sum_j w_{t,j} = 1$ , $w_{t,j}\ge0$ .

Reward functions differ between settings. In standard financial DRL, the reward is the daily portfolio return:

$r_t = \sum_{j=1}^T w_{t,j}(p_{t,j} - p_{t-1,j}),$

where $p_{t,j}$ is the price of asset $j$ at $t$ .

For ESG regulation, the reward $\mathcal{R}_t$ includes a grant or tax proportional to the portfolio’s outperformance in ESG relative to the benchmark mean, parameterized by a grant/tax strength $\lambda$ :

If the portfolio mean ESG $\varphi_t$ exceeds the index mean ESG $\psi_t$ , a grant is applied:

$\mathcal R_t = r_t + \lambda|r_t|\frac{\varphi_t-\psi_t}{10-\psi_t},$

Otherwise, a tax is incurred:

$\mathcal R_t = r_t - \lambda|r_t|\frac{\psi_t-\varphi_t}{\psi_t}.$

2. Agent Architectures and Reinforcement Learning Algorithms

Two major DRL algorithms define the ESGAgent family:

Advantage Actor–Critic (A2C): Employs separate actor and critic networks (2-layer MLPs, 64 ReLU units/layer). The actor $\pi_\theta(a_t|s_t)$ outputs portfolio weights, while the critic $V_w(s_t)$ estimates value. The policy loss incorporates an entropy regularizer for policy exploration.
Proximal Policy Optimization (PPO): Utilizes a joint neural network with shared base layers and actor/critic heads (also 2×64 MLP). The key loss is the clipped surrogate objective, which stabilizes updates via probability ratio clipping. Additional components include value and entropy losses.

Hyperparameters are set empirically (learning rates $\sim 10^{-4}$ , entropy bonuses $c_{ent}\approx 0.005$ , discount $\gamma=0.99$ ). Batch and update schemes are detailed for reproducibility (Garrido-Merchán et al., 2023).

3. Multi-Objective Optimization and Hyperparameter Selection

Recent advances (Garrido-Merchán et al., 17 Dec 2025) formulate portfolio management as a multi-objective optimization over both risk-return (Sharpe ratio) and sustainability (mean ESG score). The agent’s combined reward at each time step is:

$r_t = \alpha r_t^{SR} + (1-\alpha)r_t^{ESG}, \quad \alpha\in[0,1],$

where $r_t^{SR}$ and $r_t^{ESG}$ are stepwise mappings of daily return and ESG, respectively. The scalarization coefficient $\alpha$ modulates the trade-off.

Multi-Objective Bayesian Optimization (MOBO): Hyperparameter tuning (including $\alpha$ ) uses Gaussian Process surrogates for both the Sharpe ratio $f_1$ and mean portfolio ESG $f_2$ . The Expected Hypervolume Improvement (EHVI) acquisition function sequentially proposes new hyperparameters to maximize coverage of the Pareto front:

$x \mapsto f(x) = (SR(\pi_\theta(x)), mean\_ESG(\pi_\theta(x))),$

where $x$ is the vector of DRL hyperparameters. Parallel and random search baselines are compared.

Pareto Frontier and Hypervolume Metrics: The non-dominated set $F$ of $(SR, ESG)$ results is reported, and its hypervolume quantifies trade-off quality.

4. Experimental Evaluation Across Markets

Empirical validation involves both single-objective and multi-objective DRL-ESGAgents, using OpenAI Gym environments wrapped by FinRL and ESG data from Bloomberg.

Asset universes: DJIA-30, NASDAQ-100, IBEX-35.
Data: Daily OHLCV, monthly ESG, technicals; training 2008–2022, test 2023 (DJIA/NASDAQ) (Garrido-Merchán et al., 17 Dec 2025).
Metrics: Cumulative return, Sharpe/Sortino/Calmar ratios, max drawdown, volatility, Omega ratio.

Key reported results:

Market	Agent Type	Cumulative Return	Sharpe Ratio	Mean ESG Score
DJIA	ESG-DRL	16.58%	2.15	—
NASDAQ-100	ESG-DRL	45.80%	1.78	2.95
IBEX-35	ESG-DRL	8.6%	0.88	—

In most settings, ESG-regulated DRL agents match or modestly exceed “free-market” DRL agents’ risk-adjusted returns (Garrido-Merchán et al., 2023).
Multi-objective optimization yields Pareto frontiers spanning Sharpe and mean ESG; hypervolume under BO exceeds random search by an order of magnitude (DJIA: HV≈161.8 BO vs ≈6.1 Random) (Garrido-Merchán et al., 17 Dec 2025).

5. Implications, Limitations, and Practical Insights

Behavioral Impact: ESG reward augmentations consistently tilt the DRL policy towards higher-ESG portfolios without return sacrifice, confirming that sustainability incentives are actionable and do not degrade risk/return profiles in US markets.
Generalization: The Markov Decision Process and agent architectures generalize across market indices without modification, except for data preparation (e.g., ESG imputation for IBEX) (Garrido-Merchán et al., 2023).
Hyperparameter Sensitivity: Agent performance is highly sensitive to hyperparameters. Multi-objective Bayesian optimization is critical for discovering robust risk/ESG trade-offs in noisy, expensive evaluation regimes (Garrido-Merchán et al., 17 Dec 2025).
Interpretability: Investors may select portfolios along the empirically derived Pareto frontier to suit individual preferences for risk and sustainability.
Implementation: The architecture is built atop FinRL and OpenAI Gym, with reproducibility pointers including hyperparameter ranges and network details in the cited works.

Limitations include minor underperformance of ESG variants in sparser data regimes (e.g., IBEX), and the fact that maximizing either Sharpe or ESG exclusively is still possible by reward scalarization, as validated through single-objective ablations (Garrido-Merchán et al., 17 Dec 2025).

6. Summary and Prospects

Deep Reinforcement Learning ESGAgents represent a systematic approach to unified financial and sustainability portfolio optimization. Through principled reward engineering (linear grant/tax or convex reward combination) and integrated ESG data, these agents dynamically align portfolio learning with both market and ESG objectives. Empirical evidence from DJIA, NASDAQ-100, and IBEX-35 indicates that such regulation does not diminish performance and often improves risk-adjusted metrics, with multi-objective Bayesian optimization enabling efficient discovery of Pareto-optimal risk–sustainability portfolios. These results suggest the viability of next-generation robo-advisors that operationalize ESG considerations within their fundamental decision-making processes, rather than as after-the-fact overlays (Garrido-Merchán et al., 2023, Garrido-Merchán et al., 17 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Deep Reinforcement Learning for ESG financial portfolio management (2023)

Multi-Objective Bayesian Optimization of Deep Reinforcement Learning for Environmental, Social, and Governance (ESG) Financial Portfolio Management (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Reinforcement Learning (DRL) ESGAgents.