Natural Gradient Methods Overview

Updated 7 June 2026

Natural gradient methods are optimization algorithms that use information geometry, typically via the Fisher information, to adjust descent directions.
They generalize gradient descent by introducing a Riemannian metric, achieving parameterization invariance and faster convergence compared to traditional first-order methods.
Scalable approximations such as Kronecker-factored and block-diagonal methods enable practical application in deep learning, reinforcement learning, and PDE-constrained optimization.

Natural gradient methods are a class of optimization algorithms that generalize gradient descent by accounting for the geometry induced by a chosen similarity or divergence measure on a parameterized family of probability distributions. By replacing the standard Euclidean structure of parameter space with a Riemannian metric pulled back from the distributional geometry—typically via the Fisher information induced by the Kullback–Leibler divergence—natural gradient approaches produce directionally preconditioned updates that are invariant to reparametrization and frequently accelerate convergence relative to standard first-order methods. These techniques also generalize beyond the Fisher metric, elegantly linking optimization with wider families of Riemannian and Finslerian metrics derived from variational and information-theoretic principles.

1. Mathematical Framework and Metric Construction

Let $\mathcal{P}$ denote a family of probability densities on sample space $X$ , parameterized by a smooth map $\theta \mapsto p_\theta$ for $\theta \in \Theta \subset \mathbb{R}^n$ . Natural gradient methods start from a twice-differentiable, positive-definite similarity or cost measure $S: \mathcal{P} \times \mathcal{P} \to \mathbb{R}_{\geq 0}$ , such as the Kullback–Leibler divergence or a general $f$ -divergence. The intrinsic geometry is defined by the metric tensor $G(\theta)$ :

$G_{ij}(\theta) = \left. \frac{\partial^2}{\partial\theta_i \partial\theta'_j} S(p_\theta \Vert p_{\theta'}) \right|_{\theta' = \theta}$

The metric $G(\theta)$ , which must be positive definite, encodes the curvature of the loss landscape as seen through the lens of the chosen divergence. For the KL case, this recovers the Fisher information matrix:

$I_{ij}(\theta) = \mathbb{E}_{p_\theta} [\partial_i \log p_\theta(X)\, \partial_j \log p_\theta(X)]$

Natural gradient descent seeks direction vectors $X$ 0 in parameter space that decrease the objective $X$ 1 most rapidly, subject to a fixed-size step in the metric induced by $X$ 2. The steepest descent direction is

$X$ 3

and the update is $X$ 4, with step size $X$ 5 (Mallasto et al., 2019).

Generalizations to other similarity measures, such as $X$ 6-divergences or Wasserstein distances, yield new metric tensors (e.g., scaled Fisher or operator-valued Finsler metrics) and thus new families of natural-gradient updates, as in the Otto–Wasserstein geometry for $X$ 7 (Mallasto et al., 2019).

2. Connections to Second-Order Optimization and Invariance Properties

Natural gradient methods sit between first-order and Newton's methods. In standard Newton optimization, second-order information is taken from the Hessian of the objective. By contrast, natural gradient methods precondition using the Hessian (second derivative) of the chosen divergence or similarity measure, not the loss function.

Formally, the methods interpolate between:

First-order (Euclidean) descent: $X$ 8
Newton's method: $X$ 9
Natural gradient: $\theta \mapsto p_\theta$ 0 is the Hessian of $\theta \mapsto p_\theta$ 1 at $\theta \mapsto p_\theta$ 2

This distinction yields several important mathematical and practical characteristics:

Parameterization invariance: Because the metric is induced from the distributional geometry, natural gradient steps are invariant (at first order) to reparameterizations of $\theta \mapsto p_\theta$ 3. Under any smooth invertible transformation, the trajectory in model distribution space is unchanged (Martens, 2014).
Connection to generalized Gauss–Newton methods: In exponential family models, the Fisher information coincides with the Generalized Gauss–Newton (GGN) matrix, providing a robust Hessian substitute that is always positive semidefinite and more stable than the true Hessian (Martens, 2014, Shrestha, 2023).
Trust-region/proximal equivalence: Natural gradient steps can be derived as trust-region or proximal updates under the chosen metric, with quadratic constraint $\theta \mapsto p_\theta$ 4 (Mallasto et al., 2019).

3. Practical Algorithmic Realizations and Large-Scale Approximations

Direct computation and inversion of full Fisher metrics or general pull-back metrics is infeasible for large-scale models. The field has therefore developed a suite of scalable approximations:

Approximation	Memory	Update Complexity	Notes
Full matrix	$\theta \mapsto p_\theta$ 5	$\theta \mapsto p_\theta$ 6	Infeasible; only for small $\theta \mapsto p_\theta$ 7 (Shrestha, 2023)
Diagonal	$\theta \mapsto p_\theta$ 8	$\theta \mapsto p_\theta$ 9	Basis for Adam, RMSProp (Shrestha, 2023)
Block-diagonal	$\theta \in \Theta \subset \mathbb{R}^n$ 0	$\theta \in \Theta \subset \mathbb{R}^n$ 1	Natural per-layer updates; practical for medium networks (Yang et al., 2020)
Kronecker-factored (KFAC)	$\theta \in \Theta \subset \mathbb{R}^n$ 2	$\theta \in \Theta \subset \mathbb{R}^n$ 3	Standard in deep learning; robust curvature estimates (Shrestha, 2023, Yang et al., 2021)
Low-rank+diag	$\theta \in \Theta \subset \mathbb{R}^n$ 4	$\theta \in \Theta \subset \mathbb{R}^n$ 5	TONGA and variants (Shrestha, 2023)
Sketch-based	$\theta \in \Theta \subset \mathbb{R}^n$ 6	$\theta \in \Theta \subset \mathbb{R}^n$ 7	SENG, scalable to very-wide or convolutional nets (Yang et al., 2020)

Efficient implementation relies on block or layerwise factorization, randomized sketching, or exploitation of structural properties such as Kronecker form (Yang et al., 2021, Yang et al., 2020). Line search, adaptive damping (e.g., Levenberg–Marquardt), or trust-region logic is usually incorporated to ensure stability and robust convergence (Martens, 2014, Ren et al., 2019).

Empirical and theoretical analyses confirm that natural gradient methods—when equipped with such scalable approximations—yield rapid convergence in training deep neural networks and structured models (Shrestha, 2023, Yang et al., 2021, Yang et al., 2020).

4. Extensions: Riemannian, Quantum, Wasserstein, and Non-Classical Metrics

Recent progress has further generalized natural-gradient methods along several axes:

Riemannian/Manifold-valued parameterizations: For models whose parameters reside on matrix manifolds (e.g., Stiefel, Grassmann, positive-definite), generalized "Riemannian Natural Gradient" methods use the Riemannian Fisher Information Matrix—constructed with tangent-space gradients and manifold retractions—yielding provable almost-sure convergence and locally (super)linear rates under curvature and stability assumptions (Hu et al., 2022, Hu et al., 2023).

Quantum natural gradient: In quantum parameter estimation and variational quantum algorithms, the quantum Fisher information (as SLD or via Petz operator monotone functionals) underpins the “quantum natural gradient.” Use of non-monotone Petz functions yields updates with improved empirical convergence over traditional SLD-based metrics, challenging long-held physical assumptions about monotonicity (Sasaki et al., 2024).

Wasserstein and Finsler geometry: Pulling back the 2-Wasserstein geometry to parameter space yields "Wasserstein natural gradient" methods, equipping statistical models with non-Fisher, often Finslerian, preconditioning that can encode non-local or transport-based similarity (Arbel et al., 2019, Nurbekyan et al., 2022). These formulations support kernelized or adjoint-based solvers to enable practical computation even for large and implicit models.

Generalized similarity measures and manifold transfer: Given an arbitrary divergence $\theta \in \Theta \subset \mathbb{R}^n$ 8, the formalism of (Mallasto et al., 2019) provides explicit recipes for constructing the natural-gradient direction, including for divergence forms arising in Bregman, mirror descent, or more abstract function spaces.

5. Applications and Empirical Behavior Across Domains

Deep learning: In neural networks, natural gradient and approximations (block-diagonal, KFAC, SENG, NG+) consistently outperform first-order methods like SGD/Adam, especially in ill-conditioned or large-batch regimes, showing faster progress per iteration and less sensitivity to hyperparameter choices (Yang et al., 2021, Yang et al., 2020, Lu et al., 19 Aug 2025). Empirical results on tasks including ImageNet classification, quantum chemistry, machine translation, and recommendation systems demonstrate significant wall-clock and data efficiency gains over SGD and momentum methods.

Reinforcement learning: Natural policy gradient (NPG) methods, which use the Fisher matrix of the policy distribution averaged over trajectories, are provably parameterization-invariant and under standard assumptions attain linear or quadratic convergence rates, with improved sample efficiency over standard PG (Müller et al., 2022, Liu et al., 2022). Recent variance-reduced NPG variants further enhance sample complexity and global convergence guarantees (Liu et al., 2022).

PDE-constrained and physics-informed learning: Natural gradient formulations extend to inverse problems and PDE-constrained optimization, offering superior performance in nonconvex settings where standard gradients stall (Nurbekyan et al., 2022).

Probabilistic models and variational inference: Surrogate natural-gradient approaches, which apply the update in a tractable surrogate manifold, enable efficient optimization in settings with intractable or singular Fisher matrices (So et al., 2023), supporting variational inference and maximum likelihood estimation.

6. Methodological Interrelations and Theoretical Results

Natural gradient methods unify and generalize several families of optimization algorithms:

Trust-Region and Proximal Methods: Both can be interpreted as constrained or penalized natural-gradient adaptations with the Euclidean metric replaced by the Riemannian metric induced from the chosen divergence (Mallasto et al., 2019).
Mirror Descent: Replaces the quadratic penalty with a Bregman divergence, recovering natural gradient as the Riemannian limit of the more general mirror descent formalism (Mallasto et al., 2019).
Newton and Hessian-free methods: If one uses the Hessian of the objective instead of the similarity measure, Newton's method is obtained, with natural gradient forming an intermediate between first- and second-order strategies (Martens, 2014, Mallasto et al., 2019).

Theoretical convergence guarantees underpin these methods. Stochastic and deterministic variants exhibit global stationarity, linear rates under strong convexity, and quadratic rates under smoothness and Jacobian stability, with adaptation to manifold and non-Euclidean settings (Hu et al., 2022, Yang et al., 2021, Arbel et al., 2019). In policy gradient, sample efficiency and global optimality are strictly improved relative to standard PG/SGD (Liu et al., 2022).

7. Recent Advancements and Limitations

Recent innovations include techniques for robustification and scalability:

Variance-aware and orthogonal projection schemes: For large-batch stochastic settings, Fisher-Orthogonal Projection (FOP) and related corrections inject curvature-aware gradient variability lost in pure averaging, stabilizing convergence at extreme batch sizes where first- and second-order methods otherwise degrade (Lu et al., 19 Aug 2025).
Gradient-regularized variants: Explicit addition of gradient-norm penalties to the Fisher geometry (Gradient-Regularized Natural Gradients) improves both generalization and numerical conditioning, outperforming standard and adaptive first-order methods in vision and language benchmarks (Dash et al., 26 Jan 2026).
Decentralized and parallel natural gradient: Algorithms with Kronecker-product RFIM approximations and decentralized consensus protocols extend geometric optimization to federated and distributed settings (Hu et al., 2023).

However, challenges remain: stability under ill-conditioning, generalization beyond convergence-to-training loss, cost of metric inversion at extreme scale, hyperparameter tuning for damping and regularization, and the choice of metric to match the data and model geometry are active areas of research (Shrestha, 2023, Dash et al., 26 Jan 2026).

References: