Wasserstein Gradient Flows

Updated 8 August 2025

Wasserstein gradient flows are defined as the evolution of probability measures via the steepest descent of energy functionals in the Wasserstein space, offering a clear variational framework for nonlinear PDEs.
They utilize the JKO scheme, entropic regularization, and dual methods to translate complex optimal transport problems into tractable numerical algorithms.
Applications include advanced machine learning, robust inference, and reinforcement learning, supported by rigorous links to large deviations, Gamma-convergence, and metric measure theory.

Wasserstein gradient flows provide a geometric and variational perspective for the evolution of probability measures, characterizing a wide class of nonlinear partial differential equations as the steepest descent of energy functionals with respect to the Wasserstein metric. This variational viewpoint connects optimal transport theory, large deviations, thermodynamic limits, and the design of stable and efficient numerical schemes for high-dimensional inference, machine learning, and computational physics.

1. Variational Characterization and the JKO Scheme

The canonical example of a Wasserstein gradient flow is the Fokker–Planck equation,

$\partial_t \rho = \Delta \rho + \operatorname{div}(\rho \nabla V),$

which can be written as the gradient flow of the free energy

$F(\rho) = \int \rho(x) \log \rho(x) \, dx + \int V(x)\rho(x) \, dx$

in the space of probability measures equipped with the Wasserstein-2 metric $W_2$ (Duong et al., 2012, Santambrogio, 2016). The foundational Jordan–Kinderlehrer–Otto (JKO) scheme discretizes the evolution by solving, for time step $h>0$ ,

$\rho_{k} = \operatorname*{argmin}_{\rho} \left[ F(\rho) + \frac{1}{2h} W_2^2(\rho, \rho_{k-1}) \right].$

This recursive minimization steps forward in time by finding the probability density $\rho_k$ that decreases the sum of the free energy and the transportation cost away from the previous iterate, with precise geometric meaning.

The minimization can often be equivalently reformulated in terms of transport maps due to Brenier's theorem, using convex analysis, or—especially in the discrete case—via Kantorovich duality, c-concave functions, or, in high dimensions, parameterization with input convex neural networks (ICNNs) (Mokrov et al., 2021).

2. Large Deviations, Thermodynamic Limits, and Gamma-Convergence

A rigorous macroscopic–microscopic link is established between the continuous JKO variational scheme and the large deviation rate functionals of the underlying stochastic particle systems (Duong et al., 2012). Specifically, hydrodynamic scaling of symmetric stochastic particles yields a large deviation rate functional written as

$J_T(\rho | \rho_0) \approx \frac{1}{4T} W_2^2(\rho_0, \rho) + \left[ F(\rho) - F(\rho_0) \right]$

in the small-time limit. Gamma-convergence ensures that the minimizers of the discrete-time large deviation rate functional converge to minimizers of the continuum Wasserstein gradient flow. This explicit bridge means that both the dissipative macroscopic evolution and the exponential rate of rare fluctuations are governed by the same free energy landscape within the Wasserstein geometry (Duong et al., 2012).

3. Numerical Methodologies: Entropic, Lagrangian, and Dual Approaches

Eulerian and Entropic Regularization

The numerical realization of the JKO minimization faces a bottleneck due to high computational cost of evaluating $W_2^2(\cdot, \cdot)$ and solving high-dimensional optimal transport problems. Entropic regularization replaces the OT constraint by a strictly convex cost,

$W_\gamma(p,q) := \min_{\pi \in \Pi(p,q)} \left\langle c, \pi \right\rangle + \gamma E(\pi),$

where $E(\pi)$ is the entropy functional and $\gamma>0$ . This reformulation casts the proximal JKO step as a KL-proximal operator, which can be efficiently solved by iterative scaling (e.g., Sinkhorn’s algorithm) or KL-Dykstra algorithms, and uses fast convolution (Gaussian filtering for quadratic cost) or efficient heat-kernel multiplication for complex domains (Peyré, 2015).

Example: Crowd Motion with Congestion

For energies incorporating congestion constraints (e.g., $p \in [0,\kappa]^n$ ), the KL-proximal operator can be evaluated exactly, allowing the evolution of densities under both geometric and physical constraints.

Lagrangian Flow Dynamics

Rather than operating directly in Eulerian coordinates, one may reformulate the gradient flow in terms of particle trajectories. By introducing a flow map $x = x(X, t)$ that transports reference mass $X$ to position $x$ at time $t$ , and updating densities via the change-of-variable formula

$\rho(x, t) = \rho(X, 0) / \det(\partial x / \partial X),$

one obtains schemes that naturally preserve positivity, mass, and energy dissipation, and efficiently resolve sharp interfaces with relatively few degrees of freedom (Cheng et al., 21 Jun 2024). The corresponding variational principle involves unconstrained minimization in $L^2$ distance augmented with regularization terms, resulting in stable and accurate dynamics for various classes of nonlinear diffusion, aggregation, and chemotaxis equations.

Dual Formulations and Back-and-Forth Methods

Dualizing the JKO step via Kantorovich duality reformulates the implicit variational problem as a maximization over convex potentials. The back-and-forth method alternates between dual variables (potentials) and leverages efficient solvers (FFT-based, Legendre transforms) for large-scale simulation, handling both convex and non-convex energies (Jacobs et al., 2020).

4. Extensions: Regularized f-Divergence Flows and Kernels

Regularization and kernel techniques enable defining Wasserstein gradient flows for energies beyond the classical entropy or interaction energies. For instance, kernelized Maximum Mean Discrepancy (MMD) regularizations and Moreau envelopes extend f-divergences to be differentiable and well-posed even for measures with singular support (Stein et al., 7 Feb 2024, Duong et al., 14 Nov 2024). These formulations have analytical (Links to RKHS structure and Moreau–Rockafellar theory), numerical (particle methods and dual maximizations), and statistical (robustness, improved convergence) benefits.

Gradient flows can be efficiently characterized in one dimension via quantile functions: $W_2^2(\mu,\nu) = \int_0^1 |Q_\mu(s) - Q_\nu(s)|^2 ds$ , allowing direct minimization and analysis of regularized functionals in $L^2(0,1)$ (Duong et al., 14 Nov 2024). Sobolev regularization further ensures well-posedness and remedies mass dissipation anomalies in flows driven by nonconvex or singular kernels.

5. Applications: Machine Learning, Inference, and Beyond

The Wasserstein gradient flow framework is the unifying principle behind a broad array of modern computational techniques:

Approximate Inference and Variational Methods: Time-discretized gradient flows compute updated densities as solutions to regularized optimal transport problems, approximated in high dimensions using dual RKHS representations, parameterized ICNNs, or saddle-point algorithms, with theoretical guarantees of convergence (Frogner et al., 2018, Fan et al., 2021).
Policy Optimization in Reinforcement Learning: Interpreting policy updates as Wasserstein gradient flows over policy distribution spaces leads to algorithms (e.g., IP-WGF, DP-WGF) with provable advantages in exploration, robustness, and convergence relative to parameter-space optimization (Zhang et al., 2018).
Learning over Distributions of Distributions: Wasserstein-over-Wasserstein (“WoW”) flows extend the OT geometry to spaces of probability distributions (e.g., labeled datasets seen as mixtures of class-conditional measures), supporting domain adaptation, dataset distillation, and transfer learning via tractable flows with sliced-Wasserstein based kernels (Bonet et al., 9 Jun 2025).
Sampling and Bayesian Inference: The forward-backward (proximal gradient) discretization, in which one alternates an explicit gradient (transport) step on a smooth potential and an implicit proximal step for a nonsmooth entropy or divergence, generalizes Langevin and related sampling methods with interpretable rates and stability (Salim et al., 2020).
Metric Measure Theory: Wasserstein gradient flows extend the notions of heat flow, curvature, and differential calculus to abstract metric measure spaces (e.g., RCD $(K,\infty)$ spaces), providing a link between PDEs and metric geometry (Santambrogio, 2016).
Optimal Frames and Signal Processing: Wasserstein gradient flows are used to variationally evolve frames in signal processing toward tight frames, exploiting geometric decay of suitable potentials in Wasserstein space (Wickman et al., 2018).

6. Theoretical Innovations and Connections

Temporal evolution of empirical distribution data (e.g., income or mortality) can be modeled by nonparametric local Fréchet regression within Wasserstein space, yielding well-posed estimates of temporal gradients with provable error rates (Chen et al., 2018). Homogenization of gradient flows in spatially inhomogeneous or oscillatory media preserves the geometric structure in the limit, albeit with possibly nontrivial effective metrics—distinct from what is obtained via Gromov–Hausdorff limits—highlighting the interplay between metric convergence and dynamical variational structures (Gao et al., 2023).

Gamma-convergence links large deviation theory and mean-field variational flows in the continuum, cementing the fundamental connection between stochastic particle dynamics, PDEs, and OT geometry (Duong et al., 2012).

7. Summary Table: Core Wasserstein Gradient Flow Concepts

Concept/Technique	Variational/Geometric Quantity	Computational/Numerical Implication
JKO Scheme	Implicit minimizing movement in $W_2$	Recursive quadratic OT-solver, stability
Entropic Regularization	KL divergence, Gibbs kernel	Fast Sinkhorn, parallelizability
Lagrangian Approach	Particle flow map tracking	Automatic positivity, interface capture
Dual Methods (BFM, etc.)	Kantorovich dual (potentials $\varphi$ )	Efficient gradient ascent, large scales
MMD/Kernel Regularized Flows	Moreau envelope in RKHS	Differentiable, tractable, robust
Sliced/Projected Flows	Sliced Wasserstein, quantile embeddings	Dimension-independent, generative modeling
Fibered/Mixture Flows	Multi-indexed/fibered $W_2$	Learning on datasets of distributions

8. Outlook and Future Directions

Emerging developments include extensions to coupled and continuum systems of gradient flows (e.g., fibered Wasserstein flows for multi-phase PDEs), unbalanced OT and flows on unnormalized measures, connections to non-Euclidean and network domains, and scalable algorithms using generative models. The field continues to diversify, integrating geometric, analytic, and computational tools for modeling, optimization, and simulation in spaces of probability measures.

Wasserstein gradient flows now serve as a central framework bridging rigorous analysis, efficient computation, and diverse physical, statistical, and engineering applications.