Deep Learning Policy Iteration for HJB Equations

Updated 4 September 2025

Deep learning-driven policy iteration schemes are numerical methods that integrate policy evaluation and improvement with neural networks to solve nonlinear HJB equations.
The methodology leverages physics-informed neural networks and automatic differentiation to eliminate grid discretization errors and mitigate high-dimensional challenges.
Empirical results demonstrate rapid convergence and practical applications in portfolio optimization under both exogenous and liquidity-driven transaction costs.

A deep learning-driven policy iteration scheme is a class of algorithms that embed policy iteration principles—alternating between policy evaluation and policy improvement—into modern deep learning frameworks, typically to solve high-dimensional, nonlinear Hamilton–Jacobi–Bellman (HJB) equations that arise in stochastic optimal control problems, such as portfolio selection under complex transaction costs. These schemes leverage neural network approximators for both value and policy functions, constructing mesh-free, numerically robust approaches that directly address the curse of dimensionality, enable high-dimensional control, and remove discretization errors by exploiting automatic differentiation (Yan et al., 2 Sep 2025).

1. Deep Learning Framework for HJB Equations

The central methodological innovation is the replacement of grid-based discretization by physics-informed neural networks (PINNs). The scheme introduces two deep neural networks:

Value Network $Q_\phi(W, L, t)$ : Approximates the value function for the optimal control problem.
Control Network $\omega_\psi(W, L, t)$ : Parameterizes the trading strategy, i.e., the optimal portfolio allocation.

The HJB equation in the context of portfolio selection with both exogenous (proportional) and endogenous (liquidity-driven) transaction costs is two-dimensional and nonlinear. The value network is trained to minimize a loss consisting of the HJB residual and the terminal condition:

$\mathcal{L}^{(k)}(\phi) = \mathbb{E}_{W,L,t}\left[\left(\mathcal{L}^{(\omega_{\psi^{(k-1)}})} Q_\phi (W, L, t)\right)^2 \right] + \mathbb{E}_{W,L} \left| Q_\phi(W, L, T) - U(W) \right|^2$

where $\mathcal{L}^{(\omega)}$ is the differential operator in the HJB, and $U(W)$ is the utility function (Yan et al., 2 Sep 2025).

2. Iterative Policy Iteration via Deep Neural Networks

The approach integrates an iterative, alternating scheme akin to standard policy iteration, but with neural networks:

Policy Evaluation (PE): Given a fixed current policy $\omega_{\psi^{(k-1)}}$ , update the parameters $\phi$ of $Q_\phi$ by minimizing the loss above.
Policy Improvement (PI): Given $Q_{\phi^{(k)}}$ , update $\psi$ to maximize the expected HJB operator applied to $Q_{\phi^{(k)}}$ :

$\psi^{(k)} = \arg\max_\psi \mathbb{E}_{W,L,t}\left[ \mathcal{L}^{(\omega_\psi)} Q_{\phi^{(k)}}(W, L, t)\right]$

Automatic differentiation, batch sampling in $(W, L, t)$ space, and non-uniform sampling near regions of high HJB-PDE residual are employed to ensure numerically efficient optimization. Notably, both the value and control approximators are updated end-to-end in a mesh-free fashion, suitable for high-dimensional state and control spaces (Yan et al., 2 Sep 2025).

3. Hamilton–Jacobi–Bellman Equation and Transaction Cost Modeling

The HJB equation solved takes the prototypical form:

$\max_{\omega\in[0,1]} \{\mathcal{L}^{(\omega)}Q(W, L, t)\} = 0,\quad (W,L,t)\in\Omega_T$

$Q(W, L, T) = U(W)$

where $\omega$ is the proportion allocated to risky assets, $(W, L, t)$ is the wealth, liquidity, and time, and $U$ is a utility function. The HJB operator $\mathcal{L}^{(\omega)}$ encodes:

Drift terms: incorporating risk-free growth, excess returns, minus transaction costs.
Diffusion/covariance structure: arising from both market volatility and liquidity risk.
Correction terms: including a non-linear addition derived from proportional transaction costs, such as

$\sqrt{\frac{2}{\pi\delta t}}\,\kappa\omega (1-\omega)W \sqrt{(\beta L+\rho_1\sigma_S)^2 + (1-\rho_1^2)\sigma_S^2 }$

Liquidity risk: $L$ is modeled as a mean-reverting Ornstein–Uhlenbeck process, affecting endogenous costs; exogenous costs are proportional to transaction volume (Yan et al., 2 Sep 2025).

4. Handling Exogenous and Endogenous Transaction Costs

Exogenous transaction costs: These are proportional to the amount traded; mathematically, in wealth dynamics: $-\kappa S |\nu|$ , with $\kappa$ the cost rate.
Endogenous transaction costs: Driven by liquidity risk, $L$ , which evolves stochastically and influences the mean reversion level via $\theta(L) = \bar{\theta} + \kappa \lambda g(L)$ , with $g(L)$ a chosen function (e.g., concave power).

Both costs are integrated into the HJB operator, affecting optimal allocation and policy construction (Yan et al., 2 Sep 2025).

5. Numerical Analysis, Convergence, and Validation

The scheme is validated against analytic benchmarks (e.g., the classical Merton problem without transaction costs and liquidity risk), demonstrating both the value function and policy approximation converge rapidly to the analytic solution.
Empirically, convergence is exponential or faster: runs typically achieve relative differences below $10^{-5}$ in fewer than five policy iteration steps.
Numerical results confirm that greater liquidity sensitivity ( $\beta$ ) and larger proportional cost ( $\kappa$ ) decrease allocation to risky assets, illustrating the direct impact of transaction frictions.
Plots from Monte Carlo simulations (including shaded standard deviations) show robust, repeatable convergence properties, with the mesh-free neural approach maintaining numerical stability (Yan et al., 2 Sep 2025).

6. Advantages and Limitations

Advantages

Curse of Dimensionality Mitigated: Mesh-free solution and batch sampling in the neural network optimization facilitate scaling to higher-dimensional state/control spaces.
Elimination of Truncation Errors: Automatic differentiation removes finite-difference and grid-discretization artifacts present in classical numerical methods.
Adaptability: The control policy is an explicit neural network; the framework is directly extensible to problems requiring high-dimensional control, nonlinear costs, or complex transaction cost structures.
Fast Convergence: Both the theoretical (via universal approximation guarantees) and experimental analyses confirm expeditious convergence.

Limitations

Non-convex Training: The interplay between the two networks introduces non-convexity in the optimization landscape, making convergence sensitive to initialization, architecture, and optimizer choice.
Sampling Strategy: Argument is provided that sampling should focus near residual-dense regions, but practical performance depends on high-quality sampling and loss weighting.
Resource Demands: While the method is more scalable compared to mesh-based solvers, computational requirements for training deep networks in high dimensions remain substantial (Yan et al., 2 Sep 2025).

7. Applications and Broader Implications

The approach is directly applicable to portfolio optimization problems with both proportional trading frictions and liquidity-driven costs, as well as a broad class of high-dimensional nonlinear stochastic control problems with similarly structured HJB equations. The policy iteration framework using PINNs enables quantitative paper of transaction cost effects, facilitates robust and scalable numerical experimentation with real-world financial models, and opens new avenues in computational finance for problems where analytic or grid-based numerical solutions are infeasible. The method provides a flexible blueprint for integrating advanced transaction cost models, control constraints, and high-dimensional risk factors in quantitative control analytics.

In summary, this deep learning–driven policy iteration framework represents a substantial advance in the numerical paper of constrained stochastic control, enabling tractable and robust solutions to complex, high-dimensional HJB equations relevant to dynamic portfolio optimization under realistic market frictions and liquidity effects (Yan et al., 2 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

A deep learning-driven iterative scheme for high-dimensional HJB equations in portfolio selection with exogenous and endogenous costs (2025)

Follow Topic

Get notified by email when new papers are published related to Deep Learning-Driven Policy Iteration Scheme.