Fisher–Rao Gradient Flows

Updated 29 November 2025

Fisher–Rao gradient flows are continuous-time steepest descent methods on density manifolds that employ the Fisher–Rao metric to capture statistical distinctiveness.
They underpin natural gradient techniques and are pivotal in variational inference, robust sampling, and accelerated optimization in machine learning.
The framework supports closed-form geodesics and operator splitting, yielding uniform convergence rates without reliance on log-concavity assumptions.

Fisher-Rao gradient flows are a class of continuous-time dynamical systems on the space of probability distributions, defined as steepest-descent flows with respect to the Fisher–Rao (FR) Riemannian metric of information geometry. Unlike Euclidean or Wasserstein metrics, the Fisher–Rao metric captures statistical distinguishability through a global, geometrically intrinsic structure. These flows are foundational to natural gradient methods, variational inference, robust and accelerated sampling, operator splitting for measures with varying mass, and rise to closed-form nonlinear PDEs relevant to theoretical and algorithmic developments in machine learning, statistics, and stochastic control.

1. Fisher–Rao Metric: Geometry and Gradient Flows

The Fisher–Rao metric equips the manifold $P$ of smooth, positive densities (e.g., on $\mathbb{R}^d$ ) with the Riemannian inner product

$g^\mathrm{FR}_\rho(u, v) = \int \frac{u(x) v(x)}{\rho(x)}\, dx,$

for tangent vectors $u, v$ with zero mean under $\rho$ . Geodesics under this metric are geometric mixtures: for densities $\rho_0, \rho_1$ , the constant-speed curve is

$\rho_t(x) \propto \rho_0(x)^{1-t} \rho_1(x)^t,$

with the closed-form geodesic distance (Hellinger/FR distance)

$\mathrm{FR}(\rho_0, \rho_1) = 2 \arccos \int \sqrt{\rho_0(x)\rho_1(x)}\, dx.$

Under square-root embedding $q(x) = \sqrt{\rho(x)}$ , $g^\mathrm{FR}$ becomes the canonical $L^2$ inner product on the positive orthant of the unit sphere, and geodesics correspond to great circles (Crucinio et al., 22 Nov 2025, Halder et al., 2017, Carrillo et al., 22 Jul 2024).

For a functional $\mathcal E[\rho]$ , its FR gradient at $\rho$ is

$\mathrm{grad}_{\mathrm{FR}}\mathcal E(\rho) = -\rho \left( \frac{\delta \mathcal E}{\delta \rho} - \mathbb{E}_\rho\left[\frac{\delta \mathcal E}{\delta \rho}\right]\right),$

yielding the flow PDE

$\partial_t \rho_t = -\rho_t \left(\frac{\delta \mathcal E}{\delta \rho_t} - \mathbb{E}_{\rho_t} \left[\frac{\delta \mathcal E}{\delta \rho_t}\right]\right).$

This is the “birth–death” or replicator/selection equation, nonlocal due to its mean-field normalization (Crucinio et al., 22 Nov 2025, Carrillo et al., 22 Jul 2024, Chen et al., 25 Jun 2024).

2. Gradient Flows of f-Divergences and KL: Convergence and Structural Results

Applied to convex $f$ -divergences

$D_f[\rho||\pi] = \int \pi(x)\, f\left(\frac{\rho(x)}{\pi(x)}\right) dx,$

the first variation is $\delta D_f/\delta \rho = f'(\rho/\pi) - \mathbb{E}_\rho[f'(\rho/\pi)]$ and the FR gradient flow is

$\partial_t \rho_t = -\rho_t f'\left(\frac{\rho_t}{\pi}\right) + \rho_t \mathbb{E}_{\rho_t}[f'(\rho_t/\pi)].$

For the Kullback–Leibler divergence $KL(\rho||\pi)$ , one gets the explicit solution

$\rho_t(x) \propto \pi(x)^{1-e^{-t}} \rho_0(x)^{e^{-t}},$

and exponential decay in (symmetrized) KL: $KL(\rho_t \Vert \pi) \leq J(\rho_0, \pi) e^{-t}$ holds universally, with $J(\rho, \pi) = KL(\rho \Vert \pi) + KL(\pi \Vert \rho)$ (Crucinio et al., 22 Nov 2025, Domingo-Enrich et al., 2023, Carrillo et al., 22 Jul 2024). The exponential rate is uniform: no dependence on the log-concavity or log–Sobolev constant of $\pi$ . This robustness is critical for posterior sampling beyond the log-concave regime and for problems with multimodal or poorly conditioned targets (Domingo-Enrich et al., 2023, Carrillo et al., 22 Jul 2024, Chen et al., 25 Jun 2024). See Table 1 for comparison of key decay results.

Functional	FR Gradient Flow Convergence Rate	Constraints on $\pi$
$KL(\rho\\|\pi)$	$e^{-t}$ (symm. KL) or $e^{-2t}$ (asymptotics)	None; independent of log-concavity or LSI constant
General $f$ -div	$\leq e^{-\alpha_f t}$ (dual gradient dominant)	Uniform (depends on $f$ , not $\pi$ )

3. Numerical Methods: Splitting, Particle Systems, Kernelization

FR flows admit operator-splitting methods—crucial for approximating unbalanced optimal transport gradient flows such as Wasserstein–Fisher–Rao (WFR) or Kantorovich–Fisher–Rao (KFR). The splitting alternates exact FR (birth–death) and transport (Wasserstein) steps (Crucinio et al., 22 Nov 2025, Gallouët et al., 2016). The FR semigroup is available in closed form: $S_{FR}(\gamma, \mu)(x) \propto \pi(x)^{1 - e^{-\gamma}} \mu(x)^{e^{-\gamma}},$ enabling exact or semi-exact numerical updates and low-variance interacting particle systems (Crucinio et al., 22 Nov 2025, Chen et al., 25 Jun 2024, Maurais et al., 8 Jan 2024).

RKHS-based kernel approximations offer nonparametric, high-dimensional discretizations: $\partial_t \mu = -\mu\, K_\mu \frac{\delta \mathcal E}{\delta \mu}$ with $K_\mu$ a kernel integral operator on $L^2(\mu)$ . Particle weights or function-valued ansatz may be updated via regression principles (Helmholtz–Rayleigh maximal dissipation), controlling the approximation error via evolutionary $\Gamma$ -convergence (Zhu et al., 27 Oct 2024, Maurais et al., 8 Jan 2024). This “kernel Fisher–Rao flow” and its interacting particle implementation yield finite-horizon, gradient-free and efficiently parallelizable samplers (Maurais et al., 8 Jan 2024).

4. Fisher–Rao Flows in Machine Learning: Variational, Sampling, Minimax, RL

The Fisher–Rao (natural) gradient is central to variational inference, statistical learning, and modern optimization. For a statistical model $p(x;\theta)$ , the natural gradient is

$\nabla^{\mathrm{FR}}_\theta F(\theta) = G(\theta)^{-1} \nabla_\theta F(\theta),$

where $G(\theta)$ is the Fisher information. For the evidence lower bound (ELBO), the FR gradient flow in parameter or distribution space yields coordinate-invariant, geodesically straight optimization,

$\frac{d\theta}{dt} = -G(\theta)^{-1} \nabla_\theta ELBO(\theta),$

and

$\frac{dp}{dt} = q - p$

on the full simplex (Ay et al., 2023). Under suitable “cylindrical” model conditions, optimizing ELBO by natural gradient is equivalent to minimizing the KL divergence.

In mean-field minimax games, FR flows ensure global convergence to mixed Nash equilibria under entropy regularization, with explicit Lyapunov rates and robustness to multimodality (Lascu et al., 24 May 2024). In Markov Decision Processes (MDPs), the FR policy gradient flow admits global existence and exponential convergence to entropy-regularized optima, even in uncountable Polish spaces (Kerimkulov et al., 2023, Müller et al., 28 Mar 2024). The FR flow is the continuous-time limit of the natural policy gradient, and it realizes the entropic central path of linear programs with sharp gap estimates (Müller et al., 28 Mar 2024).

5. Structural Properties, Functional Inequalities, and Accelerated Flows

Functional inequalities specific to the FR metric enable rigorous convergence guarantees. For a wide class of $f$ -divergences, dual gradient-dominance (Polyak–Łojasiewicz) inequalities hold with constants independent of the target $\pi$ : $\int \rho(x)[f'(\rho/\pi) - \mathbb{E}_{\rho}[f'(\rho/\pi)]]^2 dx \geq \alpha_f D_f[\rho || \pi].$ This leads directly to uniform exponential decay of $f$ -divergences under FR gradient flow for choices of $f$ where $x^2 f''(x)>0$ for $x \leq 1$ ; the KL case is excluded from such convexity, though the symmetrized KL still decays exponentially (Carrillo et al., 22 Jul 2024).

Accelerated Fisher–Rao flows, in analogy with Nesterov acceleration, achieve improved rates

$E(\rho_t) - E(\rho^*) = \mathcal{O}(e^{-\sqrt{\beta} t})$

when the energy $E$ is $\beta$ -strongly convex in the FR geometry. Damping terms arise through auxiliary momentum potentials and “damped Hamiltonian” flows (Wang et al., 2019).

6. Applications and Impact in Modern Computational Sciences

Fisher–Rao gradient flows underpin algorithms in:

Sampling: Birth–death samplers, kernelized and particle-based gradient flows, scalable to high dimensions and robust to multimodality (Maurais et al., 8 Jan 2024, Chen et al., 25 Jun 2024).
Bayesian inverse problems: Efficient, derivative-free samplers (Gaussian Mixture Kalman Inversion) exploiting FR splitting and moment-matching (Chen et al., 25 Jun 2024).
Filtering: Fisher–Rao proximal recursions recover the Kalman–Bucy filter in linear-Gaussian models and clarify the geometric interpretation of optimal filtering updates (Halder et al., 2017).
Generative modeling: FR flows for MMD/χ² losses in flow-matching/score-based models, backward ODE-based generation, and accelerated loss landscape navigation (Zhu et al., 27 Oct 2024).
Multi-objective optimization: Wasserstein–Fisher–Rao flows for Pareto-front particle methods combine rapid mass relocation (FR) and transport, crucial for navigating nonconvex solution sets (Ren et al., 2023).
Learning parametric and nonparametric mixtures: Joint location/weight updates under WFR geometry yield optimally convergent mixture learning algorithms outperforming traditional heuristics (Yan et al., 2023).

7. Connections, Generalizations, and Future Directions

The FR geometry is interwoven with:

Wasserstein and broader optimal transport frameworks (WFR/KFR metrics), enabling blending of reaction and transport for modeling “unbalanced” mass dynamics (Crucinio et al., 22 Nov 2025, Gallouët et al., 2016).
Kernel methods, MMD, and Stein discrepancies, where kernelized FR flows provide bridges between information geometry and reproducing kernel Hilbert space (RKHS)-based statistics (Zhu et al., 27 Oct 2024).
Functional approximation and neural parameterizations: Energy-dissipation balances (Helmholtz–Rayleigh principle) unify regression objectives, gradient flows, and neural-skewed updates in score/velocity learning (Zhu et al., 27 Oct 2024).
Operator splitting and time discretization: Proximal algorithms, JKO-type schemes, and alternating minimization allow rigorously controlled and tractable numerical methods, with tightly characterized error and stability properties (Crucinio et al., 22 Nov 2025, Gallouët et al., 2016).

In summary, Fisher–Rao gradient flows constitute a foundational, algorithmically tractable, and theoretically robust framework for natural gradient learning, global sampling, non-Euclidean optimization, and informed operator splitting on the space of probability distributions. Their uniform convergence rates, geometric consistency, and flexibility in hybrid geometric settings (e.g., WFR, KFR) make them indispensable in modern statistical learning, applied mathematics, and computational sciences.