Deep Galerkin Method (DGM) Overview

Updated 24 July 2025

Deep Galerkin Method (DGM) is a meshfree, neural network approach that approximates solutions to high-dimensional PDEs by minimizing a residual loss enforcing operator, boundary, and initial conditions.
It uses stochastic sampling and specially designed DGM layers with gating mechanisms to overcome the curse of dimensionality and mitigate issues like vanishing gradients.
DGM has achieved impressive results in applications such as financial option pricing and physics, often reaching relative errors below 0.3% in benchmark tests.

The Deep Galerkin Method (DGM) is a meshfree, neural-network-based framework for numerically solving partial differential equations (PDEs) of high complexity and dimensionality. DGM approximates the solution to a given PDE using a deep neural network trained to satisfy the equation’s operator, as well as corresponding initial and boundary conditions, directly at randomly sampled points across the domain. This approach overcomes the curse of dimensionality that plagues classical grid-based methods, and supports solving a wide range of nonlinear, high-dimensional, and free-boundary PDEs, including those in mathematical finance, stochastic control, and physics.

1. Fundamental Approach and Formulation

DGM replaces classical mesh-based and Galerkin schemes with a meshfree approach in which the solution $u(t, x)$ is approximated by a neural network $f(t, x; \theta)$ parameterized by $\theta$ . Training is performed by minimizing a residual-based loss functional that enforces the strong (pointwise) form of the equation:

$J(f) = \| \partial_t f + \mathcal{L} f \|^2_{(0,T) \times \Omega} + \| f - g \|^2_{(0,T) \times \partial\Omega} + \| f(0,\cdot) - u_0 \|^2_{\Omega}$

Here, $\mathcal{L}$ is the (generally nonlinear) spatial operator, $g$ specifies boundary conditions, and $u_0$ is the initial condition (Sirignano et al., 2017). Unlike the classical Galerkin method, which expresses $u$ as a linear combination of basis functions and projects the PDE in the weak form, DGM employs a single neural network to represent the solution across the entire domain, learning to minimize this loss in a meshfree, data-driven manner.

2. Meshfree Stochastic Optimization and Sampling

The DGM algorithm is fundamentally meshfree—no spatial grid or mesh is constructed. At each training iteration, the following procedure is repeated:

Randomly sample batches of points from the interior of the domain, boundary, and initial time.
Compute the (pointwise) squared error components of the loss functional at these samples.
Update the neural network parameters using stochastic gradient descent (SGD) based on this minibatch.

For PDEs with high-dimensional spatial domains, this random sampling—interpretable as a Monte Carlo approximation—avoids the exponential growth in computational requirements associated with mesh-based methods (Sirignano et al., 2017, Al-Aradi et al., 2018). Sampling strategies may be adapted (e.g., “box” sampling for free-boundary problems) to ensure adequate coverage near boundaries and in regions of rapid solution variation (Rou, 23 Jul 2025).

3. Neural Network Architecture and Implementation

DGM utilizes deep feedforward architectures designed to efficiently capture high-dimensional, nonlinear dependencies. It introduces “DGM layers,” which combine residual connections with LSTM-inspired gating mechanisms:

$\begin{align*} Z^\ell &= \sigma( W^{z,\ell} x + U^{z,\ell} S^\ell + b^{z,\ell} ) \ G^\ell &= \sigma( W^{g,\ell} x + U^{g,\ell} S^\ell + b^{g,\ell} ) \ R^\ell &= \sigma( W^{r,\ell} x + U^{r,\ell} S^\ell + b^{r,\ell} ) \ H^\ell &= \sigma( W^{h,\ell} x + U^{h,\ell} (S^\ell \odot R^\ell) + b^{h,\ell} ) \ S^{\ell + 1} &= (1 - G^\ell) \odot H^\ell + Z^\ell \odot S^\ell \end{align*}$

The architecture’s design mitigates vanishing gradient issues and enables training of deep networks required for high-dimensional solution approximation. The output is typically an affine transformation of the final hidden state. Derivatives needed for the residual loss are computed via automatic differentiation (Al-Aradi et al., 2018).

For problems with additional constraints, such as enforcing solution positivity or mass conservation (as in Fokker-Planck PDEs), the output can be reparameterized, e.g., as a normalized exponential, and auxiliary integral terms are then included in the residual (Al-Aradi et al., 2019).

4. Theoretical Guarantees and Convergence

A key theoretical foundation is provided by universal approximation theorems: for a suitable class of neural networks (with increasing depth or width), there exists a sequence of networks $f^n$ such that the residual loss $J(f^n) \to 0$ as $n \to \infty$ , and the network solution converges (in $L^\rho$ for $\rho < 2$ ) to the true PDE solution (Sirignano et al., 2017). Further, for weak solutions to elliptic PDEs, convergence rates in the $H^1$ norm have been established: for $n$ training samples, the error decays as $\mathcal{O}(n^{-1/d})$ under appropriate scaling of network size and sample count (Jiao et al., 2023).

Extensions to viscosity solutions for nonlinear first-order HJB equations demonstrate that, when the DGM loss (now in a sup-norm sense) vanishes, the neural network approximation converges uniformly to the unique solution (Hofgard et al., 22 May 2024). In the infinite-width limit and with sufficient training, the parameter dynamics of DGM converge to those of a limiting ODE whose solution is the true PDE solution; the residual decays exponentially under mild conditions (Jiang et al., 2023).

5. Applications in High-Dimensional and Nonlinear PDEs

DGM has been applied to several challenging classes of PDEs:

Free-boundary PDEs such as American option pricing: DGM solves the variational inequality under the constraint $u(t,x) \geq g(x)$ , accurately resolving the early exercise boundary even in up to 200 dimensions (Sirignano et al., 2017, Rou, 23 Jul 2025).
Hamilton-Jacobi-Bellman (HJB) equations arising in stochastic control and mean field games: Both classical (primal) and “unsimplified” forms (containing a supremum over controls) can be addressed by representing both the value function and the control as neural networks, trained via alternating updates inspired by policy improvement (Al-Aradi et al., 2019).
Fokker-Planck equations and mean field systems: Network outputs are constructed to enforce nonnegativity and normalization, with integral terms in the PDE handled by importance sampling (Al-Aradi et al., 2019).
Burger’s equation, Stokes equations, and other nonlinear, time-dependent, and parameterized systems: DGM has shown robust accuracy in capturing shocks, layers, and solution dependencies in continuous parameter spaces (Sirignano et al., 2017, Li et al., 2020).

In benchmark studies, DGM achieves relative errors below 0.3% in very high-dimensional American option pricing problems; similar high accuracy holds for other domains (Sirignano et al., 2017). The meshfree sampling and lack of matrix assembly considerably reduce memory and CPU overhead compared to finite-element or finite-difference schemes.

Limitations of DGM include:

Solution quality is sensitive to the distribution of sampled points; undersampled regions can degrade approximation.
Neural networks may perform poorly for PDEs whose solution includes singularities or for regions where higher-order derivatives have large dynamic range.
Training is nonconvex and may converge to local minima, though empirical results indicate generally robust convergence.

Comparative studies with deep Ritz and mixed residual methods reveal that DGM is especially effective for smooth solutions and high dimensions but may be outperformed for certain variational problems or low-regularity solutions (Chen et al., 2020, Lyu et al., 2020). Hybrid and domain-specific architectures (e.g., discontinuous Galerkin variants, ML-enhanced surrogates) further extend the potential and efficiency of the method (Chen et al., 2021, Feng et al., 14 Nov 2024).

7. Advances, Extensions, and Future Directions

DGM has seen numerous methodological advances:

Incorporation of fractional-boundary regularization via Sobolev-Slobodeckij norms improves convergence of the solution and its derivatives under challenging free-boundary or variational inequality settings (Zhao et al., 25 May 2025).
Extensions for simultaneous approximation of state and control, reparameterization for constrained densities, and adoption of “weak/Galerkin” frameworks with dual neural networks expand the methodological reach to more general classes of PDEs (Al-Aradi et al., 2019, Jiao et al., 2023).
Practical applications in geoscience, finance, fluid mechanics, and other fields demonstrate the flexibility and scalability of DGM—often in regimes previously not accessible to classical numeric methods (Zhang et al., 2022, Rou, 23 Jul 2025).

Ongoing research directions include rigorously analyzing approximation rates, adaptive sampling and regularization strategies, the design of problem-specific network architectures, and benchmarks for structures such as primal-dual consistency and solution self-testing. The method’s proven convergence in the wide-network or large-sample limit underpins its adoption for scientific computing in high-dimensional and nonlinear PDEs.

In summary, the Deep Galerkin Method is a rigorously justified, flexible, and scalable deep learning paradigm for the solution of high-dimensional, nonlinear, and free-boundary PDEs. It merges meshfree stochastic optimization with deep neural network approximation, achieving accuracy and computational tractability in problem settings inaccessible to classical alternatives. Extensions continue to broaden its applicability, supported by ongoing advances in theory, sampling, architecture, and application domain expertise.