Wasserstein-Fisher-Rao Gradient Flows

Updated 22 November 2025

Wasserstein-Fisher-Rao gradient flows are defined as a unified framework that integrates mass transport with mass creation or destruction, enabling analysis in unbalanced settings.
They extend classical optimal transport by incorporating Fisher–Rao geometry, effectively handling measures with varying total mass through combined PDE dynamics.
State-of-the-art applications utilize splitting schemes and particle approximations, yielding exponential convergence and robust performance in high-dimensional inference and optimization tasks.

Wasserstein-Fisher-Rao (WFR) gradient flows constitute a mathematical and computational framework for optimization, sampling, and inference on the space of probability measures endowed with a metric that simultaneously accounts for mass transport (Wasserstein geometry) and mass creation or annihilation (Fisher-Rao geometry). The WFR distance—also known as the Hellinger–Kantorovich metric—extends the classical transport-based geometry to "unbalanced" settings where measures may have differing total mass or require reallocation and redistribution through both spatial and amplitude dynamics. This hybrid metric underpins a large and growing class of algorithms, theoretical analyses, and PDE-driven models for high-dimensional learning, multi-objective optimization, generative modeling, density estimation, and statistical inference.

1. Mathematical Definition of the Wasserstein-Fisher-Rao Metric

The WFR metric interpolates between the classical $L^2$ -Wasserstein and Fisher-Rao geometries. Let $\mu_0$ , $\mu_1$ be absolutely continuous probability densities on $\mathbb{R}^d$ . The dynamic (Benamou–Brenier) form is:

$\mathrm{WFR}^2(\mu_0,\mu_1) = \inf_{(\rho_t, v_t, \alpha_t)_{t\in[0,1]}} \int_0^1 \int_{\mathbb{R}^d} \bigl|v_t(x)\bigr|^2 + \alpha_t(x)^2 \;\rho_t(x)\,dx\,dt$

subject to the unbalanced continuity equation: $\partial_t \rho_t + \nabla \cdot (\rho_t v_t) = \rho_t \alpha_t, \qquad \rho_{t=0} = \mu_0,\,\rho_{t=1} = \mu_1$

Transport is effected by $v_t$ , while $\alpha_t$ permits pointwise mass creation/destruction. This metric induces a Riemannian structure: tangent vectors at $\rho$ are of the form $-\nabla\cdot(\rho v) + \rho\alpha$ , with the inner product

$\langle (v_1,\alpha_1), (v_2,\alpha_2)\rangle_{T_\rho} = \int v_1\cdot v_2\,\rho + \int \alpha_1 \alpha_2\, \rho$

These structures have been closely studied in the context of both abstract measure theory and applied optimization (Crucinio et al., 6 Jun 2025, Yan et al., 2023).

2. Gradient Flow PDEs in WFR Geometry

Given a functional $F$ on the space of measures, the steepest descent under WFR yields the evolution:

$\partial_t \rho_t = -\operatorname{grad}_{\mathrm{WFR}} F(\rho_t) = \nabla \cdot (\rho_t\nabla\delta F/\delta\rho) - \rho_t\delta F/\delta\rho$

For the Kullback–Leibler divergence $F(\rho) = \mathrm{KL}(\rho\|\pi)$ , this produces the PDE:

$\partial_t\mu_t = \nabla\cdot\bigl(\mu_t \nabla\log\frac{\mu_t}{\pi}\bigr) + \mu_t\bigl(\log\frac{\pi}{\mu_t} - \mathbb{E}_{\mu_t}\log\frac{\pi}{\mu_t}\bigr)$

The first (diffusion) term is the Wasserstein gradient flow; the second (logistic) term is the Fisher-Rao birth-death component. The WFR flow thus combines mass transport (Wasserstein) with coordinated reaction (Fisher-Rao). These dynamics have been foundational in sampling, Bayesian computation, and learning algorithms (Crucinio et al., 6 Jun 2025, Yan et al., 2023, Zhu, 31 Oct 2024).

3. Discretization and Particle Algorithms

Efficient numerical schemes for WFR flows utilize splitting (a "transport" and "reaction" step per iteration) and interacting particle approximations.

A prototypical splitting scheme (JKO/minimizing-movement) computes:

Wasserstein (OT) step: $\rho^{n+1/2} = \arg\min_\rho \{\frac{1}{2\tau} W_2^2(\rho, \rho^n)+F(\rho)\}$
Fisher–Rao (reaction) step: $\rho^{n+1} = \arg\min_\rho \{\frac{1}{2\tau} \mathrm{FR}^2(\rho, \rho^{n+1/2})+F(\rho)\}$

In particle-based settings, $\rho_t$ is approximated by a weighted empirical measure and the updates alternate between moving locations (W-step) and updating weights (FR-step). For KL-divergence flows, location updates correspond to Langevin diffusion, while weights are adjusted according to local log-likelihood ratios (Crucinio et al., 6 Jun 2025, Yan et al., 2023).

Such approaches have proved effective for high-dimensional mixture modeling and for sampling from complex densities where pure transport or pure reaction methods fail due to mode collapse or insufficient exploration.

4. Theoretical Guarantees and Convergence Properties

For functionals $F$ that are $\lambda$ -geodesically convex in $(\mathcal{P}(\mathbb{R}^d), D_{\mathrm{WFR}})$ , flows exhibit exponential convergence: $D_{\mathrm{WFR}}(\rho_t, \rho^*) \leq e^{-\lambda t} D_{\mathrm{WFR}}(\rho_0, \rho^*)$ and $F(\rho_t)\searrow F(\rho^*)$ at the same rate (Yan et al., 2023, Ren et al., 2023). This exponential decay persists for the inclusive KL divergence $\mathrm{KL}(\pi \Vert \mu)$ under very mild assumptions; no log-concavity or log-Sobolev condition is needed for global convergence (Zhu, 31 Oct 2024).

The WFR dissipation rate is never slower than either the Wasserstein or Fisher–Rao components. In particle algorithms, law-of-large-numbers convergence holds, and splitting-scheme discretization yields controlled error of order $O(\tau)$ in step size and $O(m^{-1/2})$ in particle count (Crucinio et al., 6 Jun 2025, Yan et al., 2023).

5. Applications: Inference, Sampling, and Optimization

WFR gradient flows have been applied in multiple settings:

Nonparametric Maximum Likelihood for Gaussian Mixtures: Alternating updates in locations and weights outperform both pure-EM (Fisher-Rao) and Wasserstein descent, avoiding mode-dropping and bad local minima (Yan et al., 2023).
Monte Carlo Sampling: SMC–WFR algorithms achieve superior performance over birth–death–Langevin competitors for strongly multimodal/posterior targets (Crucinio et al., 6 Jun 2025).
Multi-Objective Optimization (MOO): WFR geometry underpins birth–death and transport-based particle methods that relocate dominated particles and adaptively populate Pareto fronts, even when they are disconnected or nonconvex (Ren et al., 2023).
Inclusive KL Inference: WFR flows not only provide a rigorous foundation for "inclusive" KL minimization but explain the efficacy and limitations of kernel-based flows (MMD, KSD, IFT) and why birth–death augmentation is essential for global support-finding and dimension-free exponential convergence (Zhu, 31 Oct 2024).
Sampling from Boltzmann Densities: Neural sampling dynamics based on WFR address issues with linear interpolations, such as velocity blow-up and "teleportation-of-mass," and produce stable flows via Fokker–Planck PDEs (Chemseddine et al., 4 Oct 2024).

6. Analytical Phenomena and Numerical Considerations

WFR flows reveal nontrivial pathologies for naive linear interpolation between densities (notably, "teleportation-of-mass" and velocity explosion near endpoints), as analyzed in the context of Boltzmann density sampling. Gradient-flow interpolations—where $v_t$ is aligned with the Wasserstein gradient of KL—exhibit uniform boundedness of the velocity field and superior statistical efficiency (ESS, NLL, energy distance) in high-dimensional mixtures and rugged landscapes (Chemseddine et al., 4 Oct 2024).

Empirically, tempered or annealed WFR flows, where the target is replaced with geometric mixtures, do not improve convergence in continuous time (Crucinio et al., 6 Jun 2025). Instead, the essential benefit comes from the joint leverage of transport and reaction: the ability to move and reweight particles simultaneously.

7. Conceptual and Algorithmic Synthesis

The WFR geometry provides a unified Riemannian framework for balancing mass transport and reaction, bridging optimal transport theory with information geometry. Its gradient flows underlie new classes of algorithms that robustly solve problems inaccessible to classical transport or Fokker–Planck flows alone.

Summary Table: Key Features of WFR Gradient Flows

Feature	Wasserstein ( $W_2$ )	Fisher–Rao (FR)	Wasserstein–Fisher–Rao (WFR)
Mass transport	Yes	No	Yes
Mass creation/destruction	No	Yes	Yes
Geometry preserves mass	Yes	No	No
Convergence rate	Geometry-dependent	Exponential (PL inequality)	At least as fast as either
Example PDEs	Fokker–Planck/Langevin	Birth–death/logistic ODE	Combined PDE (see above)

A plausible implication is that WFR flows will continue to catalyze progress in fields where both adaptation of support and redistribution of mass are critical, such as Bayesian inference with multi-modal posteriors, adaptive Monte Carlo, and large-scale nonparametric estimation (Zhu, 31 Oct 2024, Crucinio et al., 6 Jun 2025, Yan et al., 2023, Ren et al., 2023, Chemseddine et al., 4 Oct 2024).