Deep Galerkin Methods (DGMs)

Updated 4 August 2025

Deep Galerkin Methods (DGMs) are mesh-free neural network approaches that approximate PDE solutions by minimizing the strong residual over space–time.
DGMs leverage Monte Carlo sampling for interior, boundary, and initial conditions while enforcing constraints through penalty terms, effectively addressing the curse of dimensionality.
Theoretical convergence guarantees and versatile applications in finance, physics, and control underscore DGMs as a promising tool that complements traditional numerical methods.

Deep Galerkin Methods (DGMs) represent a class of mesh-free, neural network-based algorithms for approximating solutions to high-dimensional partial differential equations (PDEs). Instead of expressing the solution as a linear combination of basis functions (as in classical Galerkin or finite element methods), DGMs model it with a deep neural network trained to minimize the residual (in the appropriate norm) of the differential operator, initial condition, and boundary conditions, sampled at randomly chosen points throughout the problem domain. This approach enables practical solution of PDEs in dimensional regimes that are intractable for traditional mesh-based solvers due to the curse of dimensionality, and it leverages the universal approximation properties of neural architectures.

1. Formulation and Theoretical Foundations

DGMs approximate the solution $u(t,x)$ of a PDE with a neural network $f(t,x;\theta)$ , where $\theta$ denotes trainable parameters. The network is optimized to minimize the loss function

$J(f) = \|\partial_t f(t,x;\theta) + \mathcal{L} f(t,x;\theta)\|^2_{[0,T]\times\Omega, \nu_1} + \|f(t,x;\theta) - g(t,x)\|^2_{[0,T]\times\partial\Omega, \nu_2} + \|f(0,x;\theta) - u_0(x)\|^2_{\Omega, \nu_3}$

where $\mathcal{L}$ is the spatial differential operator, $u_0(x)$ the initial data, $g(t,x)$ the boundary condition, and $\nu_1, \nu_2, \nu_3$ are measures over the interior, boundary, and initial time-space accordingly (Sirignano et al., 2017). This L²-type loss integrates the squared residuals of the strong differential form, boundary, and initial data against suitable measures.

A central theorem proves the existence of a sequence of neural networks $\{f^n\}$ , such that $J(f^n)\to 0$ and $f^n\to u$ in $L^\rho$ for every $\rho < 2$ , provided $u$ is sufficiently smooth. This result uses the universal approximation theorems of Cybenko and Hornik and confirms that deep networks not only approximate solutions but can also drive the PDE residual arbitrarily close to zero for quasilinear parabolic PDEs, supporting the theoretical viability of the method (Sirignano et al., 2017).

2. Algorithmic Structure and Implementation

DGMs are mesh-free and rely on Monte Carlo sampling of space–time points in the domain, on the boundary, and at initial time:

For each minibatch, points are sampled randomly from $[0,T]\times\Omega$ , $[0,T]\times\partial\Omega$ , and $\{0\}\times\Omega$ .
The network $f(t,x;\theta)$ is evaluated at these points, and the loss $J(f)$ is stochastically estimated.
Stochastic gradient descent (SGD) or a variant is used to update network parameters via

$\theta \leftarrow \theta - \alpha \nabla_\theta G(\theta, s)$

where $G(\theta,s)$ samples $J(f)$ over minibatch $s$ .

High-dimensional second derivatives (e.g., for diffusion terms) are approximated with an efficient Monte Carlo finite-difference scheme. For a term such as $½ \sum_{ij} \rho_{ij} \sigma_i(x)\sigma_j(x)\partial^2 f / \partial x_i\partial x_j$ , DGM uses

$\lim_{\Delta \to 0} \mathbb{E}\left[\sum_i \frac{ \partial f/\partial x_i(t, x+\sigma(x) W_\Delta; \theta) - \partial f/\partial x_i(t, x; \theta) }{\Delta} \, \sigma_i(x) W_\Delta^i \right]$

with $W_\Delta$ a standard Brownian increment and $\Delta$ small, using antithetic variates to reduce bias and achieve $O(\Delta)$ error (Sirignano et al., 2017).

The neural network architecture is chosen to efficiently model sharp solution features, such as shock or free boundary layers. LSTM- and highway network-inspired blocks are used, e.g.,

$S^1 = \sigma(W^1 \bar{x} + b^1), \quad Z^\ell = \sigma(U^{\ell,z}\bar{x} + W^{\ell,z} S^\ell + b^{\ell,z})$

with nonlinear activation $\sigma$ and $\bar{x} = (t, x)$ .

3. Comparison with Classical Galerkin and Discontinuous Galerkin Methods

While DGMs are distinct from traditional (finite element or discontinuous Galerkin (DG)) methods, there are conceptual parallels. Classical DG methods use finite element expansions in broken function spaces and weakly enforce continuity via numerical fluxes and penalty terms, enabling stability even with discontinuous trial functions (Hong et al., 2017). DGMs, in contrast, approximate the global solution directly with a neural network, whose smoothness is controlled by the loss function’s penalty terms for boundary and initial conditions. Both methodologies weakly impose constraints—classical Galerkin by test function orthogonality; DGMs by summing squared residuals—although DGMs do so in the strong (pointwise) form rather than variational form.

Stabilization and penalization approaches in DG provide a philosophical foundation for DGMs’ penalty-based enforcement of boundary and regularity conditions and suggest avenues for designing improved loss functions in neural PDE solvers (Hong et al., 2017).

4. Extensions, Applications, and Boundary Conditions

DGMs have been extended to more complex classes of PDEs arising in stochastic control, mean field games, and high-dimensional finance (Al-Aradi et al., 2019). For PDEs with positivity/integration constraints (e.g., Fokker–Planck), DGM reparameterizes the solution as a normalized exponential of a neural network output, ensuring positivity and unit mass, and handles resulting nonlinear partial integro-differential equations using importance sampling for the needed integrals. For Hamilton–Jacobi–Bellman equations containing an intrinsic optimization, dual-network training is employed: one network represents the value function and another the optimal control, alternating updates in a policy improvement scheme.

Boundary conditions in DGMs are typically enforced via penalty terms in the total loss, with the specific form adapted to Dirichlet, Neumann, Robin, or periodic conditions (Chen et al., 2020). For Dirichlet BCs, an alternative is to build the solution as $u(x;\theta) = L_D(x) \, \mathrm{DNN}(x;\theta) + G(x)$ , where $L_D(x)$ vanishes on the boundary and $G(x)$ extends boundary data into $\Omega$ , thus satisfying BCs by construction.

5. Convergence Properties and Practical Limitations

Rigorous convergence guarantees for DGMs have been established under appropriate regularity and statistical learning assumptions:

The objective function’s minimization drives the neural network solution toward the true PDE solution in the relevant Sobolev norm, with convergence in $L^2$ and $H^1$ as the number of hidden units increases and the number of samples grows (Jiao et al., 2023).
The convergence rate is $O(n^{-1/d})$ with $n$ training samples in $d$ dimensions, reflecting the curse of dimensionality.
Global convergence in the “wide network limit” (number of units $\to\infty$ ) is proven: the training dynamics converge to an infinite-dimensional linear ODE whose stationary solution matches the PDE solution, provided standard regularity and an invertibility condition of the PDE operator hold (Jiang et al., 2023).

Practical bottlenecks involve sensitivity to high-dimensional optimization, the challenge of enforcing low-regularity or discontinuous solutions, and the computational cost in sampling and backpropagation of high-order derivatives, especially as the dimension grows or with sharp solution features. Empirical studies demonstrate DGM robustness up to 200 dimensions for free-boundary (American options) and stochastic control problems (Sirignano et al., 2017), but require careful tuning of architecture, penalty weights, and activation function selection for stability.

DGMs are closely related to Physics-Informed Neural Networks (PINNs) and Deep Ritz methods. The principal difference lies in the formulation of the loss: DGM minimizes the strong residual; PINNs may include measurement data terms; Deep Ritz methods employ variational (energy-based) losses that often require only first derivatives and yield enhanced numerical stability for low-regularity problems (Chen et al., 2020, Musco et al., 12 Sep 2024). For elliptic PDEs, DRM is sometimes favored for high-dimensional or non-smooth cases, while DGM can outperform DRM for specific low-regularity or highly smooth problems, depending on network architecture, activation selection, and penalty weighting.

7. Impact and Outlook

DGMs have significantly expanded the tractable regime for PDE-based modeling in quantitative finance, physics, and engineering domains, especially for high-dimensional, nonlinear, and free-boundary problems. Their mesh-free and parallelizable character, in conjunction with theoretical convergence results, makes them a compelling tool for scenarios where classical methods falter. Ongoing work incorporates stabilization techniques from DG/FEM literature, convergence proofs for weak solutions, and the extension to multiphysics and coupled PDE systems. Hybrid methods and domain-aware architectures, as well as advances in optimization (e.g., variance reduction, adaptive sampling), are driving further improvements in efficiency, accuracy, and interpretability of DGMs in real-world scientific computing.