Convergence of Stochastic Gradient Methods for Wide Two-Layer Physics-Informed Neural Networks
(2508.21571v1)
Published 29 Aug 2025 in cs.LG, cs.NA, math.NA, and stat.ML
Abstract: Physics informed neural networks (PINNs) represent a very popular class of neural solvers for partial differential equations. In practice, one often employs stochastic gradient descent type algorithms to train the neural network. Therefore, the convergence guarantee of stochastic gradient descent is of fundamental importance. In this work, we establish the linear convergence of stochastic gradient descent / flow in training over-parameterized two layer PINNs for a general class of activation functions in the sense of high probability. These results extend the existing result [18] in which gradient descent was analyzed. The challenge of the analysis lies in handling the dynamic randomness introduced by stochastic optimization methods. The key of the analysis lies in ensuring the positive definiteness of suitable Gram matrices during the training. The analysis sheds insight into the dynamics of the optimization process, and provides guarantees on the neural networks trained by stochastic algorithms.
Summary
The paper demonstrates that mini-batch SGD and SGF achieve linear convergence under high-probability NTK Gram matrix control for PINNs.
It employs rigorous analysis using concentration inequalities and the lazy training regime to ensure minimal deviation from initialization.
The work provides explicit conditions on network width, learning rate, and activation function to effectively solve linear PDEs.
Convergence of Stochastic Gradient Methods for Wide Two-Layer Physics-Informed Neural Networks
Introduction and Motivation
Physics-Informed Neural Networks (PINNs) have become a standard approach for solving partial differential equations (PDEs) by embedding the governing equations and boundary conditions directly into the loss function of a neural network. While empirical success is well-documented, theoretical understanding—especially regarding the convergence of stochastic optimization methods such as stochastic gradient descent (SGD)—remains limited. This work rigorously addresses the convergence properties of SGD and stochastic gradient flow (SGF) for over-parameterized (wide) two-layer PINNs, extending prior results that were restricted to deterministic gradient descent and specific activation functions.
Problem Setting and PINN Formulation
The analysis focuses on the Poisson equation with Dirichlet boundary conditions, but the framework extends to other linear second-order PDEs. The PINN architecture is a standard two-layer fully connected network: ϕ(x;w,a)=m1r=1∑marσ(wr⊤x),
where m is the width, σ is a locally Lipschitz activation function (piecewise C3), and the parameters are initialized with Gaussian and Rademacher distributions for weights and output coefficients, respectively.
The empirical PINN loss is: L(w,a)=2n11p=1∑n1(Δϕ(xp;w,a)−f(xp))2+2n2γq=1∑n2(ϕ(yq;w,a)−g(yq))2,
where {xp} and {yq} are interior and boundary sample points, respectively.
Main Theoretical Results
Gram Matrix Analysis and NTK Regime
A central technical challenge is to ensure the positive definiteness of the empirical neural tangent kernel (NTK) Gram matrices throughout training. The analysis leverages the "lazy training" regime, where parameter updates remain close to initialization, and the NTK remains nearly constant. The key assumptions are:
The activation function is locally Lipschitz up to third order.
The infinite-width Gram matrices (expectation over initialization) are strictly positive definite for the chosen sample points.
Under these conditions, the smallest eigenvalue of the empirical Gram matrix remains bounded away from zero with high probability, provided m is sufficiently large (scaling polynomially with problem parameters and logarithmically with m).
Convergence of SGD
The main result establishes that, for sufficiently wide networks and small enough learning rates, mini-batch SGD achieves linear convergence in expectation to the global minimum of the PINN loss, with high probability over random initialization and stochasticity in the optimization. Specifically, for step size η and smallest Gram eigenvalue λθ,
E[L(t)]≤(1−2ηλθ)tL(0).
The required network width m and step size η are explicitly characterized in terms of problem dimension, sample size, and spectral properties of the Gram matrices. The analysis carefully controls the deviation of parameters from initialization to ensure the NTK regime persists throughout training.
Convergence of Stochastic Gradient Flow
The continuous-time analogue, SGF, is analyzed via stochastic differential equations. Using Itô calculus and the Polyak-Łojasiewicz (PL) property of the loss (guaranteed by the positive definiteness of the Gram matrix), the expected loss decays exponentially: E[L(t)]≤exp(−2λθt)L(0).
The analysis requires the activation function to be locally Lipschitz up to fourth order to control the Hessian terms in the Itô expansion.
Technical Innovations
The analysis employs concentration inequalities for sub-Weibull random variables to control the random initialization and stochasticity in SGD.
Uniform control over all parameter deviations is achieved via a stopping time argument, ensuring the NTK regime is maintained with high probability.
The results hold for a broad class of activation functions, including smooth and high-order RePU, under milder regularity assumptions than previous works.
Numerical and Theoretical Implications
Strong claims in the paper include:
SGD and SGF achieve linear convergence to global minima for over-parameterized two-layer PINNs under high-probability control of the NTK Gram matrix.
The required network width for SGD is higher than for deterministic gradient descent, reflecting the additional randomness in the optimization trajectory.
The analysis is robust to a wide class of activation functions, not limited to ReLU3 or analytic activations.
Limitations:
The results are restricted to linear PDEs; for nonlinear PDEs, the positive definiteness of the Gram matrix may fail, and the analysis does not extend.
The spectral properties of the Gram matrix, which are critical for convergence rates, are not fully characterized and depend on the sampling scheme and network architecture.
Practical Considerations for Implementation
Network Width: To guarantee convergence, the width m must scale polynomially with the problem dimension and logarithmically with the number of samples and the desired confidence level.
Learning Rate: The step size η must be chosen inversely proportional to the smallest eigenvalue of the Gram matrix and polynomially in the network width.
Initialization: Gaussian initialization for weights and Rademacher for output coefficients are required to ensure the concentration results.
Activation Function: Any piecewise C3 locally Lipschitz activation (e.g., high-order RePU, softplus, tanh) is admissible.
Batch Size: The analysis covers mini-batch SGD, and the batch size can be chosen flexibly, provided unbiasedness is maintained.
for t inrange(T):
# Sample mini-batch indices I, J for interior and boundary points
grad_w, grad_a = compute_PINN_gradients(w, a, batch_I, batch_J)
w -= eta * grad_w
a -= eta * grad_a
# Optionally: monitor ||w - w0||, ||a - a0|| to ensure NTK regime
Resource Requirements:
Memory and compute scale linearly with m and the number of samples.
Automatic differentiation is required for efficient computation of second derivatives in the loss.
Deployment:
The results justify the use of SGD for large-scale PINN training in high-dimensional linear PDEs, provided the network is sufficiently wide.
For practical applications, empirical monitoring of the NTK spectrum and parameter drift is recommended to ensure the theoretical regime is maintained.
Open Problems and Future Directions
Nonlinear PDEs: Extending the convergence analysis to nonlinear PDEs remains open, as the NTK Gram matrix may lose positive definiteness.
Spectral Analysis: Precise characterization of the smallest eigenvalue of the PINN NTK Gram matrix as a function of network architecture, activation, and sampling is an important open problem.
Beyond Two Layers: Generalization to deeper PINN architectures and other optimization algorithms (e.g., Adam, L-BFGS) is of practical interest.
Generalization Error: While optimization convergence is established, the link to generalization (approximation error to the true PDE solution) requires further paper.
Conclusion
This work provides a rigorous convergence theory for stochastic gradient methods applied to wide two-layer PINNs solving linear PDEs. The results establish linear convergence rates in expectation for both SGD and SGF under high-probability control of the NTK Gram matrix, with explicit conditions on network width, learning rate, and activation function regularity. These findings offer theoretical justification for the widespread empirical use of SGD in PINN training and highlight key directions for future research in the optimization and analysis of neural PDE solvers.