Wasserstein-2 Distance (W₂)
- Wasserstein-2 distance is a metric that quantifies the minimal squared Euclidean cost to morph one probability distribution into another.
- It features dual static (Kantorovich) and dynamic (Benamou–Brenier) formulations, establishing a robust Riemannian geometry on distribution spaces.
- Its practical applications span inverse problems, generative modeling, and geometric measure theory, with computational tools like Sinkhorn and neural solvers.
The Wasserstein-2 distance (), also known as the quadratic Wasserstein distance, is a central metric in optimal transport theory, quantifying the minimal cost required to morph one probability distribution into another with respect to the squared Euclidean cost. induces a rich geometric structure on the space of probability measures with finite second moments, providing both static (Kantorovich) and dynamic (Benamou–Brenier) characterizations. Its properties enable detailed analysis of probability distributions, inform data-driven applications such as inverse problems, generative modeling, and manifold learning, and underlie deep connections to geometry, functional analysis, and partial differential equations.
1. Formal Definitions and Fundamental Properties
Let denote the set of Borel probability measures on with finite second moment. The quadratic Wasserstein distance between is defined via the Kantorovich formulation: where is the set of couplings of and (joint laws with marginals and ) (Wang et al., 2024, Ren et al., 2020, Engquist et al., 2019, Peyre, 2011).
For empirical measures, in one dimension, admits a quantile representation: where denote the quantile functions (Berthet et al., 2020).
Core properties include:
- Metric structure: metrizes weak convergence plus convergence of second moments; is a complete separable geodesic space.
- Translation invariance: for any (Wang et al., 2024).
- Closed-form for Gaussians: For and ,
(Oh et al., 2019, Engquist et al., 2019).
2. Duality, Gradient Flow, and Riemannian Formalism
The dual (Kantorovich) formulation is
where (Engquist et al., 2019, Huang et al., 2024, Korotin et al., 2019, Berthet et al., 2020). For Lebesgue, the Monge formulation seeks a transport map pushing to while minimizing .
Displacement interpolation yields constant-speed geodesics: This underlies the formal Riemannian structure on : tangent vectors at are gradients with the inner product (Hamm et al., 2023).
In the dynamic (Benamou–Brenier) formulation,
subject to , , (Hamm et al., 2023, Peyre, 2011).
Gradient flows in (e.g., minimizing in ) correspond to solutions to ODEs of the form , where is the Kantorovich potential (Huang et al., 2024). Exponential convergence rates in Wasserstein space can be established under convexity assumptions (Ren et al., 2020).
3. Linearization and Equivalence with Weighted Sobolev Norms
For distributions and small perturbations , relaxes to a negative weighted Sobolev norm: where
(Peyre, 2011, Greengard et al., 2022, Engquist et al., 2019). For smooth reference measure , is equivalent (up to explicit factors) to this dual Sobolev norm, justifying its use in analytic and geometric arguments.
A quantitative version is: for near , where is the weighted norm arising from the linearized Monge–Ampère equation (Greengard et al., 2022).
Localization results show that can be bounded above by an explicit multiple of , for suitable bump functions (Peyre, 2011).
4. Statistical and Computational Aspects
For empirical distributions from i.i.d. samples , the rate of convergence for the mean distance to the true law is
slower than the classical rate, due to large Gaussian tail fluctuations. For two correlated samples, the weak convergence rate reverts to (Berthet et al., 2020).
Computational solvers include:
- Discrete linear programming (exact on finite supports)
- Entropic-regularized Sinkhorn algorithms for scalable approximations (Wang et al., 2024, Engquist et al., 2019)
- Neural network-based approaches (e.g., Input-Convex Neural Networks for Monge maps) in generative modeling (Korotin et al., 2019, Huang et al., 2024)
Algorithmic variants exploit the convexity structure induced by in parameter spaces: for instance, the loss over affine-Gaussian families is globally convex, and its gradient is a preconditioned version of the gradient, leading to smoother, better-conditioned optimization landscapes (Engquist et al., 2019).
5. Applications Across Fields
(A) Inverse Problems: provides robustness against high-frequency data noise, leading to smoothing effects in inversion but reduced spatial resolution. Compared with , yields more favorable convexity properties in parameter spaces of practical inverse problems (Engquist et al., 2019).
(B) Machine Learning and Generative Modeling: Wasserstein-2 metrics underpin algorithms in unsupervised learning, such as W2-GAN and non-minimax training of optimal transport maps via ICNNs; these models demonstrate advantages in image translation, style transfer, and domain adaptation tasks (Korotin et al., 2019, Huang et al., 2024, Oh et al., 2019).
(C) Stochastic Processes: is the natural metric for quantifying convergence of distributions in mean-field SDEs, McKean–Vlasov equations, and for bounding control errors in SDE parameter inference (Huang et al., 2024, Ren et al., 2020, Xia et al., 2024).
(D) Manifold and Geometric Learning: The 2-Wasserstein distance encodes a Riemannian geometry on the space of absolutely continuous measures, allowing the recovery of tangent spaces and geodesic structures in data-driven manifold learning (Hamm et al., 2023).
(E) Geometric Measure Theory: Localized variants lead to necessary and sufficient characterizations of rectifiability; the square-integrable numbers, based on local flatness, provide a scale-invariant, transport-based criterion for -rectifiability (Dąbrowski, 2019).
6. Recent Extensions and Variants
Relative-translation invariant Wasserstein ():
where is minimized over all translations. This provides a bias-variance decomposition of distribution shift, practical robustness to global translations, and enables efficient computation via a barycenter alignment step followed by standard Sinkhorn iterations (Wang et al., 2024).
Kernel Wasserstein Distance: For data in nonlinear feature spaces, may be computed in a reproducing kernel Hilbert space (RKHS) using empirical mean and covariance embeddings, with practical success in imaging clustering and artifact detection (Oh et al., 2019).
7. Theoretical and Practical Considerations
The metric supports:
- Stability: Small or perturbations correspond to small changes in , with explicit constants under density and curvature conditions (Peyre, 2011, Greengard et al., 2022, Engquist et al., 2019).
- Localization: The distance is stable under restriction to subsets via smooth cutoff functions (Peyre, 2011).
- High-dimensional robustness: maintains interpretability and computational feasibility in high dimensions via regularized and neural approaches (Korotin et al., 2019, Engquist et al., 2019).
- Limitations: Local Gaussian or RKHS-based approximations lose fine multimodal structure; computational cost for exact scales cubically in point count but is mitigated by Sinkhorn and neural methods (Oh et al., 2019).
In summary, the Wasserstein-2 distance and its associated geometries form the backbone of modern optimal transport theory and its applications, offering precise metrics, algorithmic tractability, and a pathway to interpretability across modern data-driven disciplines (Berthet et al., 2020, Engquist et al., 2019, Greengard et al., 2022, Wang et al., 2024, Huang et al., 2024, Hamm et al., 2023, Ren et al., 2020, Oh et al., 2019, Dąbrowski, 2019, Peyre, 2011).