Wasserstein-2 Distance (W₂)

Updated 13 March 2026

Wasserstein-2 distance is a metric that quantifies the minimal squared Euclidean cost to morph one probability distribution into another.
It features dual static (Kantorovich) and dynamic (Benamou–Brenier) formulations, establishing a robust Riemannian geometry on distribution spaces.
Its practical applications span inverse problems, generative modeling, and geometric measure theory, with computational tools like Sinkhorn and neural solvers.

The Wasserstein-2 distance ( $W_2$ ), also known as the quadratic Wasserstein distance, is a central metric in optimal transport theory, quantifying the minimal cost required to morph one probability distribution into another with respect to the squared Euclidean cost. $W_2$ induces a rich geometric structure on the space of probability measures with finite second moments, providing both static (Kantorovich) and dynamic (Benamou–Brenier) characterizations. Its properties enable detailed analysis of probability distributions, inform data-driven applications such as inverse problems, generative modeling, and manifold learning, and underlie deep connections to geometry, functional analysis, and partial differential equations.

1. Formal Definitions and Fundamental Properties

Let $\mathcal{P}_2(\mathbb{R}^d)$ denote the set of Borel probability measures on $\mathbb{R}^d$ with finite second moment. The quadratic Wasserstein distance between $\mu,\nu \in \mathcal{P}_2(\mathbb{R}^d)$ is defined via the Kantorovich formulation: $W_2(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{\mathbb{R}^d \times \mathbb{R}^d} \|x-y\|_2^2\, d\pi(x, y) \right)^{1/2}$ where $\Pi(\mu, \nu)$ is the set of couplings of $\mu$ and $\nu$ (joint laws with marginals $\mu$ and $\nu$ ) (Wang et al., 2024, Ren et al., 2020, Engquist et al., 2019, Peyre, 2011).

For empirical measures, in one dimension, $W_2^2$ admits a quantile representation: $W_2^2(P, Q) = \int_0^1 [F_P^{-1}(u) - F_Q^{-1}(u)]^2\, du$ where $F_P^{-1}, F_Q^{-1}$ denote the quantile functions (Berthet et al., 2020).

Core properties include:

Metric structure: $W_2$ metrizes weak convergence plus convergence of second moments; $(\mathcal{P}_2(\mathbb{R}^d), W_2)$ is a complete separable geodesic space.
Translation invariance: $W_2(\mu(\cdot+s), \nu(\cdot+s)) = W_2(\mu, \nu)$ for any $s \in \mathbb{R}^d$ (Wang et al., 2024).
Closed-form for Gaussians: For $N(m_1, C_1)$ and $N(m_2, C_2)$ ,

$W_2^2 = \|m_1 - m_2\|^2 + \operatorname{Tr}\left(C_1 + C_2 - 2(C_1^{1/2} C_2 C_1^{1/2})^{1/2}\right)$

(Oh et al., 2019, Engquist et al., 2019).

2. Duality, Gradient Flow, and Riemannian Formalism

The dual (Kantorovich) formulation is

$W_2^2(\mu, \nu) = \sup_{\varphi \in L^1(\mu)} \left\{ \int \varphi\, d\mu + \int \varphi^c\, d\nu \right\}$

where $\varphi^c(y) = \inf_{x} \{\|x-y\|^2 - \varphi(x)\}$ (Engquist et al., 2019, Huang et al., 2024, Korotin et al., 2019, Berthet et al., 2020). For $\mu \ll$ Lebesgue, the Monge formulation seeks a transport map $T$ pushing $\mu$ to $\nu$ while minimizing $\int \|x-T(x)\|^2 d\mu(x)$ .

Displacement interpolation yields constant-speed geodesics: $\rho_t = ((1-t)\mathrm{id} + tT)_\#\mu, \quad T = \nabla\varphi \text{ (Brenier map)}$ This underlies the formal Riemannian structure on $\mathcal{P}_2$ : tangent vectors at $\mu$ are gradients $\nabla\psi$ with the inner product $\int \langle \nabla\psi_1, \nabla\psi_2\rangle\, d\mu$ (Hamm et al., 2023).

In the dynamic (Benamou–Brenier) formulation,

$W_2^2(\mu, \nu) = \inf_{(\rho, v)} \int_0^1\!\int \|v_t(x)\|^2 \rho_t(x)\,dx\,dt$

subject to $\partial_t \rho + \nabla\cdot(v \rho) = 0$ , $\rho_0 = \mu$ , $\rho_1 = \nu$ (Hamm et al., 2023, Peyre, 2011).

Gradient flows in $(\mathcal{P}_2, W_2)$ (e.g., minimizing $W_2^2(\mu, \nu)$ in $\mu$ ) correspond to solutions to ODEs of the form $dY_t = -\nabla\varphi_{\mu_t}^\nu(Y_t)dt$ , where $\varphi_{\mu_t}^\nu$ is the Kantorovich potential (Huang et al., 2024). Exponential convergence rates in Wasserstein space can be established under convexity assumptions (Ren et al., 2020).

3. Linearization and Equivalence with Weighted Sobolev Norms

For distributions $\mu$ and small perturbations $\nu = \mu + \delta\mu$ , $W_2$ relaxes to a negative weighted Sobolev norm: $W_2(\mu, \mu + \delta\mu) = \|\delta\mu\|_{H(\mu)} + o(\|\delta\mu\|)$ where

$\|\nu\|_{H(\mu)} = \sup\left\{\int f\, d\nu: \int |\nabla f|^2 d\mu \leq 1\right\}$

(Peyre, 2011, Greengard et al., 2022, Engquist et al., 2019). For smooth reference measure $\mu$ , $W_2$ is equivalent (up to explicit factors) to this dual Sobolev norm, justifying its use in analytic and geometric arguments.

A quantitative version is: $|\,W_2(\mu, \nu) - \epsilon\|\delta\mu\|_{\dot H^{-1}(d\mu)}\,| \leq C\epsilon^2$ for $g = (1+\epsilon u)f$ near $f$ , where $\|\cdot\|_{\dot H^{-1}(d\mu)}$ is the weighted norm arising from the linearized Monge–Ampère equation (Greengard et al., 2022).

Localization results show that $W_2(\phi\mu, \phi\nu)$ can be bounded above by an explicit multiple of $W_2(\mu, \nu)$ , for suitable bump functions $\phi$ (Peyre, 2011).

4. Statistical and Computational Aspects

For empirical distributions from i.i.d. samples $X_1, \ldots, X_n \sim \mathcal{N}(0,1)$ , the rate of convergence for the mean $W_2$ distance to the true law is

$\mathbb{E}[W_2(\mathbb{F}_n, \Phi)] = \sqrt{\frac{\log\log n}{n}}\,(1 + o(1))$

slower than the classical $n^{-1/2}$ rate, due to large Gaussian tail fluctuations. For two correlated samples, the weak convergence rate reverts to $1/\sqrt{n}$ (Berthet et al., 2020).

Computational solvers include:

Discrete linear programming (exact $W_2$ on finite supports)
Entropic-regularized Sinkhorn algorithms for scalable approximations (Wang et al., 2024, Engquist et al., 2019)
Neural network-based approaches (e.g., Input-Convex Neural Networks for Monge maps) in generative modeling (Korotin et al., 2019, Huang et al., 2024)

Algorithmic variants exploit the convexity structure induced by $W_2$ in parameter spaces: for instance, the $W_2$ loss over affine-Gaussian families is globally convex, and its gradient is a preconditioned version of the $L^2$ gradient, leading to smoother, better-conditioned optimization landscapes (Engquist et al., 2019).

5. Applications Across Fields

(A) Inverse Problems: $W_2$ provides robustness against high-frequency data noise, leading to smoothing effects in inversion but reduced spatial resolution. Compared with $L^2$ , $W_2$ yields more favorable convexity properties in parameter spaces of practical inverse problems (Engquist et al., 2019).

(B) Machine Learning and Generative Modeling: Wasserstein-2 metrics underpin algorithms in unsupervised learning, such as W2-GAN and non-minimax training of optimal transport maps via ICNNs; these models demonstrate advantages in image translation, style transfer, and domain adaptation tasks (Korotin et al., 2019, Huang et al., 2024, Oh et al., 2019).

(C) Stochastic Processes: $W_2$ is the natural metric for quantifying convergence of distributions in mean-field SDEs, McKean–Vlasov equations, and for bounding control errors in SDE parameter inference (Huang et al., 2024, Ren et al., 2020, Xia et al., 2024).

(D) Manifold and Geometric Learning: The 2-Wasserstein distance encodes a Riemannian geometry on the space of absolutely continuous measures, allowing the recovery of tangent spaces and geodesic structures in data-driven manifold learning (Hamm et al., 2023).

(E) Geometric Measure Theory: Localized variants lead to necessary and sufficient characterizations of rectifiability; the square-integrable $\alpha_2$ numbers, based on local $W_2$ flatness, provide a scale-invariant, transport-based criterion for $n$ -rectifiability (Dąbrowski, 2019).

6. Recent Extensions and Variants

Relative-translation invariant Wasserstein ( $RW_2$ ):

$W_2^2(\mu, \nu) = RW_2^2(\mu, \nu) + \|\bar{\mu} - \bar{\nu}\|^2$

where $RW_2$ is minimized over all translations. This provides a bias-variance decomposition of distribution shift, practical robustness to global translations, and enables efficient computation via a barycenter alignment step followed by standard Sinkhorn iterations (Wang et al., 2024).

Kernel Wasserstein Distance: For data in nonlinear feature spaces, $W_2$ may be computed in a reproducing kernel Hilbert space (RKHS) using empirical mean and covariance embeddings, with practical success in imaging clustering and artifact detection (Oh et al., 2019).

7. Theoretical and Practical Considerations

The $W_2$ metric supports:

Stability: Small $L^2$ or $H^{-1}$ perturbations correspond to small changes in $W_2$ , with explicit constants under density and curvature conditions (Peyre, 2011, Greengard et al., 2022, Engquist et al., 2019).
Localization: The $W_2$ distance is stable under restriction to subsets via smooth cutoff functions (Peyre, 2011).
High-dimensional robustness: $W_2$ maintains interpretability and computational feasibility in high dimensions via regularized and neural approaches (Korotin et al., 2019, Engquist et al., 2019).
Limitations: Local Gaussian or RKHS-based approximations lose fine multimodal structure; computational cost for exact $W_2$ scales cubically in point count but is mitigated by Sinkhorn and neural methods (Oh et al., 2019).

In summary, the Wasserstein-2 distance and its associated geometries form the backbone of modern optimal transport theory and its applications, offering precise metrics, algorithmic tractability, and a pathway to interpretability across modern data-driven disciplines (Berthet et al., 2020, Engquist et al., 2019, Greengard et al., 2022, Wang et al., 2024, Huang et al., 2024, Hamm et al., 2023, Ren et al., 2020, Oh et al., 2019, Dąbrowski, 2019, Peyre, 2011).