Wasserstein–Fisher–Rao Gradient Flows

Updated 23 March 2026

Wasserstein–Fisher–Rao gradient flows are a unified framework that combines optimal transport (Wasserstein) with mass variation (Fisher–Rao) to analyze positive measures.
They employ coupled transport-reaction PDEs and operator splitting methods to achieve rigorous convergence and robust performance in high-dimensional settings.
Applications include generative modeling, Bayesian inference, Gaussian mixture estimation, and multi-objective optimization, all supported by strong theoretical guarantees.

The Wasserstein–Fisher–Rao (WFR) gradient flow is a mathematical framework and algorithmic paradigm that unifies optimal transport (Wasserstein) geometry with mass variation (Fisher–Rao) in the analysis and computation on the space of positive measures. WFR gradient flows are characterized by coupled transport-reaction partial differential equations (PDEs) whose geometry interpolates between classical Wasserstein transport and Fisher–Rao birth–death dynamics. These flows have found foundational roles in generative modeling, Bayesian inference, multi-objective optimization, Gaussian mixture model estimation, and particle-based inference, and are supported by a rigorous geometric, operator-theoretic, and algorithmic literature (Rahimi, 19 Dec 2025, Ren et al., 2023, Crucinio et al., 6 Jun 2025, Crucinio et al., 22 Nov 2025, Zhu, 2024, Yan et al., 2023, Mao et al., 19 Jan 2026, Chemseddine et al., 2024).

1. Geometric Formulation and the WFR Distance

The WFR metric, also referred to as the spherical Hellinger–Kantorovich distance, arises from a dynamical (Benamou–Brenier) variational principle on the space of positive measures or densities. For curves of densities $(\rho_t)$ on $\mathbb{R}^d$ , the squared WFR distance between endpoints $\rho_0$ and $\rho_1$ is defined as:

$D_{\rm WFR}^2(\rho_0, \rho_1) = \inf_{(\rho_t, v_t, \alpha_t)} \int_0^1 \int_{\mathbb{R}^d} \|v_t(x)\|^2 + \beta \alpha_t(x)^2 \, \rho_t(x) \, dx \, dt$

subject to the continuity–reaction constraint: $\partial_t \rho_t(x) + \nabla \cdot (\rho_t(x) v_t(x)) = \rho_t(x) \alpha_t(x)$ with prescribed endpoints $\rho_{t=0} = \rho_0$ , $\rho_{t=1} = \rho_1$ and $\beta > 0$ setting the mass variation cost (Rahimi, 19 Dec 2025, Ren et al., 2023, Yan et al., 2023). Here, $v_t(x)$ represents the transport (Wasserstein) velocity and $\alpha_t(x)$ encodes localized growth or decay (Fisher–Rao reaction rate).

If $\alpha_t\equiv 0$ , the metric reduces to the $W_2$ Wasserstein metric; if $v_t\equiv 0$ , it becomes the Fisher–Rao metric (Ren et al., 2023, Yan et al., 2023). Analogous dynamic and Riemannian constructions exist for discrete spaces, such as finite Markov chains (Mao et al., 19 Jan 2026).

2. PDEs and Gradient Flow Structure

Given a smooth energy functional $\mathcal{E}[\rho]$ , its gradient flow in the WFR geometry is given by coupling Wasserstein and Fisher–Rao components—in strong form:

$\partial_t \rho_t = \nabla \cdot \left( \rho_t \nabla \frac{\delta \mathcal{E}}{\delta \rho} \right) -\rho_t \left( \frac{\delta \mathcal{E}}{\delta \rho} - \int \frac{\delta \mathcal{E}}{\delta \rho} \rho_t \right)$

which cleanly separates into transport and reaction terms (Rahimi, 19 Dec 2025, Ren et al., 2023, Crucinio et al., 6 Jun 2025, Zhu, 2024, Crucinio et al., 22 Nov 2025, Yan et al., 2023). The PDE may equivalently be written as a continuity equation with source: $\partial_t\rho_t + \nabla \cdot (\rho_t v_t) = \rho_t \alpha_t$ where $v_t(x) = -\nabla \frac{\delta \mathcal{E}}{\delta \rho}(x)$ , and $\alpha_t(x) = - \left( \frac{\delta \mathcal{E}}{\delta \rho}(x) - c_t \right)$ for a suitable mean adjustment $c_t$ (Yan et al., 2023).

For prominent functionals:

KL divergence ( $\mathrm{KL}(\rho \| \pi)$ ):

$\partial_t \rho_t = \nabla \cdot (\rho_t \nabla \log \tfrac{\rho_t}{\pi}) - \rho_t \left( \log \tfrac{\rho_t}{\pi} - \int \log \tfrac{\rho_t}{\pi} \rho_t \right)$

Negative entropy (discrete case):

$\partial_t \mu_t = b^{-2} \Delta \mu_t - a^{-2} \langle \log \mu_t, p \rangle_\phi \, p$

(Mao et al., 19 Jan 2026).

The coupled transport–reaction structure allows for both mass redistribution and mass change, vital for overcoming barrier-limited mixing and supporting unbalanced measure dynamics.

3. Numerical Algorithms and Particle Systems

The WFR gradient flow admits systematic discretization and stochastic implementations:

Operator splitting/Lie–Trotter discretization alternates pure Wasserstein (Langevin-type) and pure Fisher–Rao (birth–death/importance weighting) steps. The resulting schemes interpolate between overdamped diffusion and mass reweighting, with rigorous $O(\tau)$ first-order convergence as the step-size $\tau \to 0$ (Crucinio et al., 6 Jun 2025, Crucinio et al., 22 Nov 2025, Ren et al., 2023).
Weighted stochastic differential equations (SDEs) / Feynman–Kac representation: Particle ensembles $(X_t, \omega_t)$ evolve via

$dX_t = -\nabla \varphi_t(X_t) dt + \sqrt{2} dW_t, \quad d\omega_t = \alpha_t(X_t) \omega_t dt$

where $\omega_t$ is a multiplicative weight, and the mean empirical measure $\mathbb{E}[\omega_t \delta_{X_t}]$ recovers the WFR-flow solution (Rahimi, 19 Dec 2025).

Sequential Monte Carlo (SMC–WFR): Alternates Langevin moves and exact birth–death reweighting with statistical resampling, directly approximating the infinitesimal evolution. SMC–WFR provably outperforms more basic birth–death–Langevin algorithms in challenging multimodal settings (Crucinio et al., 6 Jun 2025).
Particle-based interacting systems: Each particle has both position and weight, updated by coupled ODEs induced by the flow, e.g., for the Gaussian mixture NPMLE,

$\dot{\mu}_t^{(j)} = -\nabla \delta \ell_N(\rho_t)(\mu_t^{(j)}), \quad \dot{\omega}_t^{(j)} = -[\delta\ell_N(\rho_t)(\mu_t^{(j)}) + 1] \omega_t^{(j)}$

(Yan et al., 2023, Ren et al., 2023).

Birth–death sampling / importance weighting: Pure reaction steps reweight or replicate particles, enabling rapid relocation of mass and efficient exploration of disconnected or barrier-separated regions.

4. Theoretical Guarantees and Operator Theory

WFR gradient flows inherit the contractivity and convergence guarantees of strongly geodesically convex energies:

Energy decay and exponential convergence: For convex or strictly convex energy functionals, WFR flows dissipate energy monotonically and attain unique minimizers exponentially fast, even in the presence of reaction (Ren et al., 2023, Mao et al., 19 Jan 2026, Crucinio et al., 22 Nov 2025, Yan et al., 2023).
Spectral gap enhancement: Analysis of the linearized generator $\mathcal{A} = \mathcal{L} + \alpha - \overline{\alpha}$ (with $\mathcal{L}$ the diffusive component) reveals that mass-weighted reaction can strictly increase the spectral gap, leading to accelerated mixing when reaction is targeted to slow, metastable modes (Rahimi, 19 Dec 2025).
Log-concavity preservation: WFR flows preserve (and sometimes enhance) log-concavity along trajectories given suitable conditions on initial and target distributions (Crucinio et al., 22 Nov 2025).

For discrete spaces, the WFR-type geometry induces a reaction–diffusion equation with explicit exponential convergence to equilibrium, controlled by a discrete spectral gap and a Łojasiewicz inequality (Mao et al., 19 Jan 2026).

5. Algorithmic Splitting, Acceleration, and Practical Guidelines

Operator splitting theory provides a family of numerical schemes with direct practical significance:

Ordering and speed-up: Applying the Wasserstein substep before the Fisher–Rao (W–FR) accelerates convergence when the target is more diffuse; the reverse (FR–W) is preferable when the target is more concentrated. This "splitting bias" can yield up to 40% speed-up in model time on Gaussian problems (Crucinio et al., 22 Nov 2025).
Perturbed dynamics: Each splitting variant corresponds to a well-defined perturbed PDE, admitting explicit error analysis and sometimes beneficial bias for moderate step sizes.
Practical recommendations: For fixed computational budget, select split ordering according to initial and target variances; step sizes $O(1)$ maximize bias benefits early, while small steps recover true continuous WFR at leading order (Crucinio et al., 22 Nov 2025).

6. Applications and Numerical Experiments

WFR gradient flows have been applied in diverse domains:

Generative modeling: Weighted SDE samplers outperform pure diffusion in non-log-concave, multimodal targets, with mass reweighting (via WFR) mitigating metastability and improving mixing (Rahimi, 19 Dec 2025).
Multi-objective optimization: Interacting particle WFR flows efficiently find Pareto fronts in nonconvex and disconnected multi-objective landscapes, surpassing weighted sum or SVGD methods (Ren et al., 2023).
Gaussian mixture NPMLE: WFR-flow-based algorithms posit particle positions and weights, outperforming expectation–maximization and pure-Wasserstein flow on difficult instances by robustly escaping local traps (Yan et al., 2023).
Bayesian inference and kernelized flows: Connections to Wasserstein MMD-flow, KSD-flow, and interaction-force transport algorithms, viewed as kernelized or particle approximations of WFR dynamics, provide a unified theoretical underpinning (Zhu, 2024).
Bayesian sampling and SMC: SMC–WFR algorithms demonstrate statistical superiority over previous birth–death–Langevin methods, combining accurate transport with resampling (Crucinio et al., 6 Jun 2025).
Discrete Markov settings: Benamou–Brenier with source structures yield a coherent discrete WFR geometry, providing geometric interpretation for generalized diffusion equations with source terms and sharp exponential convergence (Mao et al., 19 Jan 2026).
Continuous interpolation and sampling: WFR and Fisher–Rao curves in Wasserstein geometry avoid teleportation pathologies of naive linear interpolation and provide well-behaved flows amenable to neural optimization and normalizing flow style samplers (Chemseddine et al., 2024).

7. Extensions, Variants, and Open Directions

Inclusive KL gradient flows: WFR geometry is also the natural setting for minimizing the inclusive KL divergence $\mathrm{KL}(\pi\| \mu)$ (forward mode), admitting unique gradient flows, particle approximations, and explicit exponential decay (Zhu, 2024).
Tempered flows and annealing: Tempering or annealing of the target in WFR does not, in continuous time, accelerate convergence relative to untampered flow. Discrete SMC–WFR implementations using exact FR updates remain competitive (Crucinio et al., 6 Jun 2025).
Absolute continuity and geometry of curves: Recent results clarify the precise conditions under which Fisher–Rao curves of Boltzmann densities are absolutely continuous in Wasserstein geometry, with explicit velocity construction via global Poisson equations (Chemseddine et al., 2024).

Ongoing research explores robustness in nonconvex and high-dimensional regimes, theoretically optimal splitting schedules, and deeper connections to information geometry, sampling algorithms, and statistical learning.

Key References: