Natural Gradient Descent

Updated 5 December 2025

Natural Gradient Descent is a second-order optimization method that leverages Riemannian geometry to navigate the parameter manifold for faster convergence.
It replaces the Euclidean steepest descent direction with one derived from problem-adaptive metrics, such as the Fisher Information Matrix, ensuring invariant updates.
Applications span deep learning, quantum variational optimization, and control systems, with scalable approximations like block-diagonal and Kronecker-factored methods.

Natural Gradient Descent (NGD) is a second-order optimization methodology in which the search for minima proceeds along directions determined by the local geometry of the underlying parameter manifold, not merely by the Euclidean metric. Unlike ordinary gradient descent, NGD adapts to the statistical or physical structure of the model, yielding more efficient convergence properties in areas such as deep learning, quantum optimization, control, and variational inference. The foundational principle is to replace the Euclidean steepest-descent direction with the direction of steepest descent as measured by a problem-adapted Riemannian metric, typically induced by the Fisher Information or an analogous quadratic form (Dong et al., 2022).

1. Geometric Foundations and Generalization of the Natural Gradient

The central concept underlying NGD is the steepest descent on a Riemannian manifold. Given a cost function $L: X \to \mathbb{R}$ , where $X$ is a parameter manifold, NGD introduces a reference manifold $Y$ with a smooth map $f: X \to Y$ and a positive-definite metric $G_Y$ on $Y$ . The induced metric on $X$ is the pull-back

$G_X(\theta) = J_f(\theta)^\top G_Y(f(\theta)) J_f(\theta)$

where $J_f(\theta)$ is the Jacobian of the map $f$ at $\theta$ (Dong et al., 2022). The update direction is then

$d\theta = -G_X(\theta)^{-1} \nabla_\theta L(\theta)$

and the next iterate is

$\theta_{t+1} = \theta_t - \eta G_X(\theta_t)^{-1} \nabla_\theta L(\theta_t)$

This framework generalizes classical NGD—recovering it when $Y$ is a statistical manifold endowed with the Fisher–Rao metric—while enabling a vast spectrum of problem-adaptive metrics (Dong et al., 2022).

2. Information Geometry: Fisher Metric and Problem-Adaptive Metrics

Classical NGD is defined in terms of the Fisher Information Matrix (FIM), which for distributional models is

$F(\theta) = \mathbb{E}_{x \sim p(\theta)} [\nabla_\theta \log p(x|\theta) \nabla_\theta \log p(x|\theta)^\top]$

The update is

$\Delta \theta = -\eta F(\theta)^{-1} \nabla_\theta L(\theta)$

which is exactly the steepest descent in the KL-divergence geometry (Martens, 2014). The Fisher metric coincides with the Generalized Gauss–Newton (GGN) matrix in many exponential-family scenarios and serves as a robust surrogate for the Hessian when the latter is indefinite or ill-conditioned (Shrestha, 2023). NGD’s invariance under reparameterization is mathematically guaranteed, as shown by explicit chain-rule computations (Kerekes et al., 2021, Martens, 2014).

In generalized settings, the reference metric $G_Y$ may be chosen as a Hessian, Fubini–Study, Wasserstein, or other problem-adapted geometric tensor (Dong et al., 2022, Nurbekyan et al., 2022, Yao et al., 2021), leading to improved convergence and optimization landscapes.

3. Computational Strategies and Structured Approximations

The principal bottleneck in NGD is the formation and inversion of large Fisher or metric tensors, which for modern neural networks scale as $O(p^2)$ in memory and $O(p^3)$ in compute for $p$ parameters. Contemporary strategies include:

Block-diagonal and layer-wise Fisher approximations: Treat the global metric as block-diagonal, enabling cheap local inversion. Component-Wise NGD (CW-NGD) and Kronecker-factored methods (K-FAC) operate by partitioning the FIM at the layer or even per-component level, exploiting independence and sparsity to reduce computational complexity (Sang et al., 2022, Lin et al., 2021, Izadi et al., 2020).
Structured metrics via local parameterization: Employing matrix-group-based structured parameterizations (e.g., block-triangular, Kronecker, low-rank, Heisenberg subgroups) facilitates tractable inversion and invariance on the structured subspace (Lin et al., 2021, Lin et al., 2021). Local parameter coordinates are mapped through a Jacobian, ensuring nondegeneracy and scalability (Lin et al., 2021).
Inverse-free NGD: Fast NGD variants precompute per-sample gradient weights and freeze them, avoiding recurrent matrix inversion after the initial epoch while closely matching performance and accuracy of full NGD (Ou et al., 6 Mar 2024).
Hybrid digital–analog computation: Thermodynamic NGD exploits analog processors to solve for $F^{-1} \nabla L$ via equilibrium in Ornstein–Uhlenbeck dynamics, drastically cutting per-iteration wall-clock time and facilitating scaling (Donatella et al., 22 May 2024).

Approximation techniques such as diagonal NGD, Kronecker-factored curvature, Woodbury inversions, or conjugate-gradient solvers are critical for scaling NGD to settings with $10^6$ – $10^9$ parameters (Shrestha, 2023, Pascanu et al., 2013).

4. Applications Across Domains

NGD underpins optimization in several research domains:

Deep neural networks: NGD yields faster convergence, improved plateaus traversal, and robustness to data ordering, outperforming SGD and Adam in both iterations and generalization in standard benchmarks. Structured variants further accelerate convergence (Pascanu et al., 2013, Sang et al., 2022, Liu et al., 2021).
Tensor networks and quantum variational optimization: Pull-back metrics from specialized ansatz spaces enable drastic acceleration in wavefunction optimization and state preparation, outperforming gradient and conjugate-gradient competitors (Dong et al., 2022, Yao et al., 2021).
Graph neural networks: K-FAC approximations for NGD improve accuracy, convergence speed, and wall-clock performance in node classification, and are generalizable to semi-supervised settings (Izadi et al., 2020).
Control and system design: NGD forms the basis for feedback control synthesis, where the Fisher metric encodes covariance-informed adjustments and enables explicit trajectory shaping with robust stability properties (Esmzad et al., 8 Mar 2025).
Variational inference and probabilistic modeling: NGD (CVI, VOGN) exploits exponential-family duality, yielding closed-form, fast updates for moments of Gaussian and more general models—even in high-dimensional regime—using natural parameter or precision-parameter representations (Barfoot, 2020, Khan et al., 2018).
Optimization in metric spaces and PDE-based settings: NGD formulated as least-squares over generic metrics (Wasserstein, Sobolev) significantly alters convergence properties and escapes local minima in large-scale PDE and physics-informed learning (Nurbekyan et al., 2022).

5. Inductive Bias, Invariance, and Limitations

Natural Gradient Descent is approximately invariant to smooth reparameterizations: the optimization trajectory depends solely on the geometry induced by the metric (Fisher or otherwise), not the coordinates (Kerekes et al., 2021, Martens, 2014). This removes architectural biases—such as margin or sparsity in supervised classifiers—that emerge from parameterization in ordinary GD. Consequently, while NGD may accelerate convergence and stabilize optimization, it can harm generalization in tasks requiring implicit regularization or bias propagation (e.g., sparse recovery, deep matrix completion), as demonstrated in extensive empirical comparisons (Kerekes et al., 2021).

Controversies arise regarding the empirical Fisher (data-distribution average), which often diverges from the true metric and may result in suboptimal curvature scaling and convergence (Martens, 2014). Further, the choice of reference metric or pull-back ansatz is a non-algorithmic “art” requiring domain knowledge; automated selection remains unresolved (Dong et al., 2022).

Scalability hinges on careful block-structure, sparsity, and fast solvers; as problem size grows, bottlenecks emerge in Jacobian products and metric estimation—the subject of ongoing research (Dong et al., 2022, Shrestha, 2023).

6. Practical Implementation and Algorithmic Variants

Implementations require:

Separate sampling for gradient and metric estimation to avoid overfitting (Pascanu et al., 2013).
Damping or trust-region regularization for numerical stability (Martens, 2014, Pascanu et al., 2013).
Block-local inversion, masking, or Kronecker products for scalability (Lin et al., 2021, Sang et al., 2022).
Use of dual variable updates, as in structured-natural or conjugate-gradient variants (Lin et al., 2021, Pascanu et al., 2013).

Table: Structured Fisher Approximations (examples from deep learning)

Method	Metric Structure	Complexity
Exact NGD	Full Fisher	$O(p^3)$
CW-NGD	Block-diagonal, per-unit	$O(\sum s_g^3)$ per layer
K-FAC	Kronecker-factored	$O(d_i^3 + d_o^3)$ per layer
Inverse-Free	Per-sample coefficients	$O(p)$ after initial epoch

Each method represents a tradeoff between computational efficiency, expressivity of the metric, and convergence behavior (Sang et al., 2022, Shrestha, 2023, Ou et al., 6 Mar 2024).

7. Unified Perspective and Extensions

Recent work establishes that any effective learning rule with strictly decreasing scalar objective can be rewritten as a form of natural gradient descent under an appropriately constructed symmetric positive-definite metric (Shoji et al., 24 Sep 2024). This unifies discrete, continuous, stochastic, higher-order, or time-varying updates into the template

$\Delta \theta = -M^{-1}(\theta) \nabla L(\theta)$

where $M(\theta)$ is chosen to optimally condition the learning update for descent. Explicit formulas for $M$ minimize the condition number, and the construction generalizes across domains, scales, and learning protocols (Shoji et al., 24 Sep 2024).

In summary, Natural Gradient Descent is both a principled information-geometric optimization method and a practical, adaptable toolset spanning domains from machine learning to physics and control. Its fast convergence, invariance properties, and geometric soundness make it central to many modern computational methodologies, but its practical implementation mandates judicious approximation and metric selection tailored to algorithmic and architectural constraints.