Fisher–Rao Norm in Information Geometry

Updated 17 March 2026

Fisher–Rao norm is a reparameterization-invariant metric that measures tangent vectors on statistical manifolds via generalized Fisher information.
It bridges finite-dimensional parametric models and infinite-dimensional nonparametric analysis, supporting statistical inference, neural network complexity, and functional data analysis.
Its applications extend to robust machine learning, gradient flow optimization, and establishing tight generalization error bounds in deep learning.

The Fisher–Rao norm is a fundamental concept in information geometry, providing an intrinsic, geometrically invariant way to measure tangent vectors and distances on statistical manifolds of probability densities, parametric families, and function spaces. It generalizes the classic notion of Fisher information from finite-dimensional parametric models to infinite-dimensional nonparametric contexts and underpins applications in statistical inference, machine learning, functional data analysis, gradient flow theory, and the complexity analysis of deep networks.

1. Geometric Definition and Structure on Densities

On the infinite-dimensional Fréchet manifold $\mathrm{Dens}_+(M)$ of smooth positive densities on a compact $m$ -manifold $M$ , every smooth, weak Riemannian metric invariant under $\mathrm{Diff}(M)$ assumes the canonical form

$G_\mu(v, w) = C_1(m) \int_M \frac{v}{\mu} \frac{w}{\mu} \ \mu + C_2(m)\left(\int_M v\right)\left(\int_M w\right), \qquad m \equiv \mu(M),$

for $v, w \in T_\mu \mathrm{Dens}_+(M)$ and $C_1, C_2 \in C^\infty(\mathbb{R}_{>0})$ (Bruveris et al., 2016). The induced Fisher–Rao norm is

$\|v\|_{FR}(\mu) = \sqrt{C_1(m) \int_M (v/\mu)^2 \, \mu + C_2(m) \left(\int_M v\right)^2}.$

Restricting to the unit-mass probability submanifold $\mathrm{Prob}(M)$ , the second term is irrelevant, yielding (up to scaling) the classical Fisher–Rao metric, characterized by the normalization $C_1(m) = 1/m$ , $C_2(m) = 0$ .

This geometric structure ensures uniqueness, up to gauge freedom in $C_1$ and $C_2$ , for $\mathrm{Diff}(M)$ -invariant metrics on densities. The Fisher–Rao metric is thus the only metric measuring infinitesimal differences between densities without violating underlying symmetry (Bruveris et al., 2016).

2. Parametric Form and Statistical Interpretation

In parametric families, such as the beta distribution with parameters $(\alpha, \beta)$ , the Fisher–Rao metric coincides with the Fisher information matrix. For the beta family: $I(\alpha, \beta) = \begin{pmatrix} \psi'(\alpha) - \psi'(\alpha+\beta) & -\psi'(\alpha+\beta) \ -\psi'(\alpha+\beta) & \psi'(\beta) - \psi'(\alpha+\beta) \end{pmatrix},$ where $\psi'$ is the trigamma function (Brigant et al., 2019). The FR norm of a tangent vector $v=(v_\alpha, v_\beta)$ is $\|v\|_{FR} = \sqrt{v^T I(\alpha, \beta) v}$ .

In machine learning, the Fisher–Rao metric on the output space of neural networks with softmax outputs yields the so-called Fisher–Rao distance (FRD), with explicit formulas: $\mathrm{FRD}(q, q') = 2 \arccos\left( \sum_{y=1}^M \sqrt{q_y q'_y} \right)$ for class probabilities $q$ and $q'$ (Picot et al., 2021). The metric coincides with the Hellinger distance (up to scaling) in nonparametric cases (Bruveris et al., 2016).

3. Analytical and Functional Frameworks

The infinite-dimensional generalization of the Fisher–Rao norm is mathematically tractable when restricted to observable subspaces. The orthogonal decomposition of the tangent space at $f$ (density) as $T_f M = S \oplus S^\perp$ , where $S$ spans derivatives $\partial f/\partial x_i$ , allows the construction of the covariate Fisher information matrix $G_f$ : $(G_f)_{ij} = \int_{\mathbb{R}^n} \frac{\partial f}{\partial x_i} \frac{\partial f}{\partial x_j} \frac{1}{f} dx = \mathbb{E}_f[s_i(X)s_j(X)],$ with $s_i = \partial_i \log f(x)$ the score functions (Cheng et al., 25 Dec 2025). The Fisher–Rao norm on $S$ is $\|h_S\|_F^2 = g_f(h_S, h_S) = v^T G_f^{-1} v$ for appropriate cross-information $v$ .

The trace $H_G(f) = \operatorname{Tr} G_f$ (G-entropy) gives the total explainable information captured by observable covariates. Curvature of the Kullback–Leibler divergence in observable directions is also governed by $G_f$ , linking FR geometry to information-theoretic optimality and the semi-parametric efficient information bound.

4. Fisher–Rao Norm in Neural Networks and Complexity Control

The Fisher–Rao norm in deep learning is defined as

$\|\theta\|^2_{FR} = \mathbb{E}\left[ \left\langle \nabla_\theta \ell(f_\theta(X),Y),\, \theta \right\rangle^2 \right],$

with the Fisher information matrix $I(\theta)$ entering as the metric tensor on parameter space (Liang et al., 2017). For ReLU networks of depth $L$ , the analytic form is

$\|\theta\|_{FR}^2 = (L+1)^2 \mathbb{E}_{(X,Y)} \left[ \partial_f \ell(f_\theta(X), Y) \cdot f_\theta(X) \right]^2,$

established using structural lemmas about derivatives in homogeneous networks.

The Fisher–Rao norm is functionally invariant: if $\theta_1, \theta_2$ parameterize the same function, $\|\theta_1\|_{FR} = \|\theta_2\|_{FR}$ . This invariance ensures the metric truly captures functional complexity rather than the scale of weights.

Norm-comparison results demonstrate that the FR norm upper-bounds are always less than or equal to spectral, group, and path norms (modulo scaling), and the FR-ball contains the sets arising in standard norm-based generalization bounds. Empirical observations reveal that, whereas ordinary norms increase with network width, $\|\theta\|_{FR}$ often remains constant or decreases—closely tracking generalization performance under both clean and corrupted labels (Liang et al., 2017, Yin et al., 2024).

5. Fisher–Rao Norm in Nonparametric and Functional Data Analysis

For functional data, the Fisher–Rao norm appears in the geometry of amplitude and phase separation. On spaces of absolutely continuous functions $f:[0,1]\to \mathbb{R}$ , with tangent $v$ ,

$\langle v_1, v_2 \rangle_f = \frac{1}{4} \int_0^1 \frac{\dot{v}_1(t) \dot{v}_2(t)}{|\dot{f}(t)|} dt,$

where $\dot{v}$ denotes differentiation and $|\dot{f}(t)|$ prevents ill-posedness. The square-root velocity function (SRVF) $q=\dot{f}/\sqrt{|\dot{f}|}$ reduces the Fisher–Rao geometry to the $L^2$ metric, enabling efficient estimation, alignment, and mean computation in function spaces (Srivastava et al., 2011).

The Fisher–Rao norm is the unique reparameterization-invariant metric on function space (Cencov's theorem). This ensures that geodesics and Karcher means are well-defined and operationally computable by standard $L^2$ tools after the SRVF transformation.

6. Gradient Flows, Kernelizations, and Learning Theory

The Fisher–Rao metric can be interpreted as defining a geometry on the cone of finite nonnegative measures. Its Riemannian structure provides explicit formulas for the norm of tangent vectors $u$ at $\mu$ : $\|u\|^2_{T_\mu} = \int \left( \frac{u}{\mu} \right)^2 d\mu.$ The Hellinger distance arises naturally as the geodesic distance in this geometry. Wasserstein or maximum mean discrepancy (MMD) metrics become particular kernelized limits or approximations of the Fisher–Rao norm's dissipation structure (Zhu et al., 2024). Gradient flows in the Fisher–Rao geometry correspond to reaction equations of the form $\partial_t \mu = -\mu \cdot \delta F/\delta \mu$ .

Kernelized Fisher–Rao and MMD flows allow nonparametric learning, generative modeling, and sampling frameworks previously grounded in optimal transport to be generalized using electively regularized versions of the FR geometry. The convergence of regularized approximations to true Fisher–Rao flows is characterized via evolutionary $\Gamma$ -convergence (Zhu et al., 2024).

7. Applications in Generalization, Robustness, and Statistical Inference

The Fisher–Rao norm provides tight generalization error bounds and controls Rademacher complexity in deep learning, with explicit dependence on architectural parameters, logit margins, and functional invariances (Yin et al., 2024, Liang et al., 2017). It measures the effect of adversarial training regimes on model complexity and predicts the trade-off between robustness and generalization gap by examining margin-based surrogates for the Fisher–Rao norm.

In adversarially robust learning, the Fisher–Rao distance or its induced norm is used as a regularization penalty, either as a function-space geodesic distance between softmax outputs (as in FIRE regularization) or as a control parameter for parameter-space complexity (Picot et al., 2021, Yin et al., 2024). These strategies enable new Pareto-optimal trade-offs between accuracy and robustness not achievable with KL-based regularization alone and facilitate more efficient training.

In nonparametric inference, the covariate Fisher information matrix $G_f$ enables practical computation of variance lower bounds (covariate Cramér–Rao lower bound), objective measures of explainable information (G-entropy), and quantification of intrinsic dimensionality (information capture ratio) in high-dimensional statistical modeling (Cheng et al., 25 Dec 2025).

The Fisher–Rao norm is a unifying and uniquely invariant geometric tool bridging mathematical statistics, information geometry, machine learning theory, and functional data analysis, providing both analytic tractability and practical utility in a vast array of modern statistical and computational paradigms (Bruveris et al., 2016, Cheng et al., 25 Dec 2025, Srivastava et al., 2011, Liang et al., 2017, Yin et al., 2024, Picot et al., 2021, Brigant et al., 2019).