Subsampled Natural Gradient Descent

Updated 30 August 2025

Subsampled Natural Gradient Descent is an optimization method that uses statistical manifold geometry and mini-batch approximations to scale natural gradient updates.
It employs scalable curvature approximations, such as block-diagonal Fisher matrices and Kronecker-factored methods, to reduce computational complexity while preserving key geometric properties.
The method achieves fast convergence under quadratic losses with structured updates and adaptive learning strategies, making it suitable for high-dimensional, large-scale applications.

Subsampled Natural Gradient Descent (SNGD) is an optimization methodology that incorporates statistical manifold geometry into parameter updates while leveraging subsampled data or curvature approximations to achieve computational scalability. SNGD generalizes natural gradient descent (NGD) by estimating the Fisher information (or related curvature metrics) and the gradient from subsets or mini-batches, making it suitable for large-scale machine learning, scientific computing, and high-dimensional statistical problems. The following sections provide an in-depth account of SNGD’s theoretical underpinnings, algorithmic techniques, convergence properties, scalable architectures, practical implications, and advanced variants.

1. Theoretical Foundations and Statistical Geometry

SNGD is rooted in the geometry of the parameter space, typically a statistical manifold endowed with a Riemannian metric—most often the Fisher Information Matrix (FIM). In the generic NGD update,

$\theta_{t+1} = \theta_t - \eta F(\theta_t)^{-1} \nabla_\theta L(\theta_t)$

where $F(\theta)$ is the FIM and $L$ is the objective. The natural gradient corresponds to the steepest descent direction with respect to the information-geometric metric, correcting for anisotropies and non-uniform curvature.

SNGD modifies this scheme by computing both $\nabla_\theta L$ and $F(\theta)$ (or approximations thereof) using a subset (mini-batch) of data. This enables updates at lower computational cost while preserving geometric information in the descent direction (Cao, 2015, Goldshlager et al., 28 Aug 2025).

Theoretical analyses for linear least-squares and strongly convex quadratic problems prove that SNGD inherits contraction rates and stability properties from regularized Kaczmarz-type methods and projected stochastic approximation frameworks. For least-squares,

$\theta_{t+1} = \theta_t - \eta J_S^+(J_S \theta_t - b_S)$

is equivalent to block-wise regularized Kaczmarz, with fast convergence governed by the spectrum of the expected projector $P(S)$ (Goldshlager et al., 28 Aug 2025).

2. Efficient Curvature Approximations and Matrix Factorization

Direct NGD is impractical for large $p$ due to the prohibitive $O(p^2)$ storage and $O(p^3)$ inversions of the full FIM. SNGD leverages several scalable approximations:

Block-Diagonal FIM: Splits the global Fisher matrix into blockwise (layerwise) components, vastly reducing storage and computation (Shrestha, 2023).
Kronecker-Factored Approximate Curvature (KFAC): Approximates blockwise FIMs by Kronecker products of smaller matrices, $F_l \approx U_l \otimes V_l$ , with per-layer inversion complexity (Palacci et al., 2018, Izadi et al., 2020).
Woodbury Matrix Identity: Converts high-dimensional inversions in parameter space to lower-dimensional sample-space computations,

$(J^T J + \lambda I)^{-1} J^T r = J^T (J J^T + \lambda I)^{-1} r$

reducing cost from $O(p^3)$ to $O(N^2 p)$ for $N \ll p$ (Guzmán-Cordero et al., 17 May 2025, Ren et al., 2019).

Structured variants implement block-triangular or group-based parameterizations to restrict the curvature matrix to tractable forms, preserving symmetry, sparsity, and invariance (Lin et al., 2021, Lin et al., 2021). Matrix square-root and iterative methods further reduce complexity for Fisher layer normalization (Liu et al., 10 Dec 2024).

3. Algorithmic Frameworks: Surrogates, Momentum, and Acceleration

SNGD extends NGD to stochastic and subsampled regimes by considering surrogate or regularized loss functions and adaptation strategies:

Stochastic Loss Surrogates: Bregman divergences and alternative geometric loss functions allow for more flexible and robust updates, with standard MLE recovered as a special case (Cao, 2015).
SPRING (Subsampled Projected-Increment NGD): Incorporates Nesterov-like acceleration and stochastic momentum, yielding provably faster convergence under quadratic losses. The update,

$\phi_{t+1} = \mu \phi_t + J_S^T (J_S J_S^T + \lambda I)^{-1} (r_S - \mu J_S \phi_t)$

and $\theta_{t+1} = \theta_t - \eta \phi_{t+1}$ , realizes an accelerated regularized Kaczmarz method (Guzmán-Cordero et al., 17 May 2025, Goldshlager et al., 28 Aug 2025).

Randomized Sketching: Nyström and sketch-and-solve approaches approximate kernel matrices for high-batch scenarios, further reducing per-iteration time—though accuracy may be limited by effective kernel rank (Guzmán-Cordero et al., 17 May 2025).

Adaptive blending of natural gradient and Euclidean gradient directions (e.g., AsymptoticNG, spherical or linear interpolation) combats overfitting and poor generalization, enabling dynamic transitions between second- and first-order behavior (Tang et al., 2020).

4. Convergence Rates and Theoretical Guarantees

Recent theoretical analyses establish explicit rates for SNGD under strong convexity and idealized linear model assumptions:

For consistent linear least-squares,

$\mathbb{E}\|\theta_t - \theta^*\|^2 \leq (1 - \alpha)^t \|\theta_0 - \theta^*\|^2$

with $\alpha = \lambda_{\min}(\overline{P})$ (minimum eigenvalue of the expected projection), guaranteeing geometric (fast) convergence (Goldshlager et al., 28 Aug 2025).

Accelerated variants (e.g., SPRING) achieve square-root speedups with error contractions of the form $(1 - \sqrt{\alpha/\beta})^t$ .
For general quadratic losses, fast rates require strong consistency: optimal parameters must lie in the function space spanned by the Jacobian, and range alignment conditions must hold.

When the assumptions are relaxed or losses are non-quadratic, robust convergence is still typically observed—though rates may slow and careful regularization is required.

5. Structured and Hierarchical SNGD Architectures

Scalability in deep neural networks is achieved by hierarchical decomposition:

Local Fisher Layers: Each network layer computes its local Fisher matrix (e.g., $G = E[v_f] E[xx^T]$ ) and applies “whitening” transforms to the parameters with tractable inversion via iterative or matrix square-root methods (Liu et al., 10 Dec 2024).
Parameter Space Transformations: Theoretical equivalence is established between NGD in the original space and fast GD in the transformed (“whitened”) space, $w' = G^{1/2} w$ (Liu et al., 10 Dec 2024).
Group Structures: Restricting parameterizations (block triangular, Heisenberg, Toeplitz) ensures second-order updates are structure-preserving, invariant, and computationally efficient (Lin et al., 2021, Lin et al., 2021).

These architectural choices directly facilitate SNGD’s practical deployment for large-scale, deep, and structured models.

6. Connections to Bayesian Sampling, Generalizations, and Control

SNGD’s stochastic updates correspond to posterior sampling under suitable temperature scaling: $T = \epsilon N / (2B)$ where $T = 1$ yields Laplace-approximated posterior samples (Smith et al., 2018). Correction terms ensure global parameterization invariance and implicit Jeffreys prior weighting when the Fisher matrix varies with parameters.

Generalizations of SNGD employ pullback metrics from reference Riemannian manifolds, allowing flexible metric design that better matches the objective’s geometry and improves convergence properties (Dong et al., 2022). In control and trajectory optimization, SNGD links covariance-based Fisher preconditioning to adaptive, contractive controller design under explicit stability constraints (Esmzad et al., 8 Mar 2025).

Recent results show any effective learning rule can be written as NGD with respect to a symmetric positive-definite metric, even in the subsampled setting; optimal metrics can be constructed to minimize the condition number and improve robustness to gradient noise (Shoji et al., 24 Sep 2024).

7. Practical Performance, Limitations, and Future Directions

Empirical evaluations demonstrate SNGD’s acceleration and scaling advantages across multiple domains:

Fast convergence and improved generalization in deep learning (ResNets, VGG, transformers) and physics-informed neural networks (Liu et al., 10 Dec 2024, Guzmán-Cordero et al., 17 May 2025).
Superior $L^2$ error and training throughput (up to $75\times$ faster) relative to baseline NGD and KFAC for PDE-constrained learning (Guzmán-Cordero et al., 17 May 2025).
Robustness to high-dimensionality via efficient sketching and blockwise curvature approximations (Shrestha, 2023, Ren et al., 2019).

Challenges include conditioning the empirical FIM (batch size sensitivity), tuning hyperparameters for stability, and accuracy degradation under over-aggressive randomization or subsampling. Adaptive momentum, dynamic metric selection, and structure-preserving design remain active areas of research (Guzmán-Cordero et al., 17 May 2025, Goldshlager et al., 28 Aug 2025).

Promising future directions entail extending theory to broader loss landscapes, improving adaptive sketching strategies, developing black-box robust SNGD variants, and integrating SNGD into control, scientific computing, and reinforcement learning pipelines.

In summary, Subsampled Natural Gradient Descent leverages statistical manifold geometry with scalable and curvature-aware updates derived from subsampled data, structured approximations, and efficient matrix factorizations. Its convergence is theoretically well-understood for quadratic losses, its architectures exploit blockwise and hierarchical Fisher estimates, and its empirical performance rivals or exceeds legacy first- and second-order methods for large-scale optimization tasks. SNGD unifies geometric, stochastic, and randomized linear algebra techniques into a flexible class of optimization algorithms for contemporary scientific machine learning.