Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wasserstein-2 Distance (W₂)

Updated 13 March 2026
  • Wasserstein-2 distance is a metric that quantifies the minimal squared Euclidean cost to morph one probability distribution into another.
  • It features dual static (Kantorovich) and dynamic (Benamou–Brenier) formulations, establishing a robust Riemannian geometry on distribution spaces.
  • Its practical applications span inverse problems, generative modeling, and geometric measure theory, with computational tools like Sinkhorn and neural solvers.

The Wasserstein-2 distance (W2W_2), also known as the quadratic Wasserstein distance, is a central metric in optimal transport theory, quantifying the minimal cost required to morph one probability distribution into another with respect to the squared Euclidean cost. W2W_2 induces a rich geometric structure on the space of probability measures with finite second moments, providing both static (Kantorovich) and dynamic (Benamou–Brenier) characterizations. Its properties enable detailed analysis of probability distributions, inform data-driven applications such as inverse problems, generative modeling, and manifold learning, and underlie deep connections to geometry, functional analysis, and partial differential equations.

1. Formal Definitions and Fundamental Properties

Let P2(Rd)\mathcal{P}_2(\mathbb{R}^d) denote the set of Borel probability measures on Rd\mathbb{R}^d with finite second moment. The quadratic Wasserstein distance between μ,νP2(Rd)\mu,\nu \in \mathcal{P}_2(\mathbb{R}^d) is defined via the Kantorovich formulation: W2(μ,ν)=(infπΠ(μ,ν)Rd×Rdxy22dπ(x,y))1/2W_2(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{\mathbb{R}^d \times \mathbb{R}^d} \|x-y\|_2^2\, d\pi(x, y) \right)^{1/2} where Π(μ,ν)\Pi(\mu, \nu) is the set of couplings of μ\mu and ν\nu (joint laws with marginals μ\mu and ν\nu) (Wang et al., 2024, Ren et al., 2020, Engquist et al., 2019, Peyre, 2011).

For empirical measures, in one dimension, W22W_2^2 admits a quantile representation: W22(P,Q)=01[FP1(u)FQ1(u)]2duW_2^2(P, Q) = \int_0^1 [F_P^{-1}(u) - F_Q^{-1}(u)]^2\, du where FP1,FQ1F_P^{-1}, F_Q^{-1} denote the quantile functions (Berthet et al., 2020).

Core properties include:

  • Metric structure: W2W_2 metrizes weak convergence plus convergence of second moments; (P2(Rd),W2)(\mathcal{P}_2(\mathbb{R}^d), W_2) is a complete separable geodesic space.
  • Translation invariance: W2(μ(+s),ν(+s))=W2(μ,ν)W_2(\mu(\cdot+s), \nu(\cdot+s)) = W_2(\mu, \nu) for any sRds \in \mathbb{R}^d (Wang et al., 2024).
  • Closed-form for Gaussians: For N(m1,C1)N(m_1, C_1) and N(m2,C2)N(m_2, C_2),

W22=m1m22+Tr(C1+C22(C11/2C2C11/2)1/2)W_2^2 = \|m_1 - m_2\|^2 + \operatorname{Tr}\left(C_1 + C_2 - 2(C_1^{1/2} C_2 C_1^{1/2})^{1/2}\right)

(Oh et al., 2019, Engquist et al., 2019).

2. Duality, Gradient Flow, and Riemannian Formalism

The dual (Kantorovich) formulation is

W22(μ,ν)=supφL1(μ){φdμ+φcdν}W_2^2(\mu, \nu) = \sup_{\varphi \in L^1(\mu)} \left\{ \int \varphi\, d\mu + \int \varphi^c\, d\nu \right\}

where φc(y)=infx{xy2φ(x)}\varphi^c(y) = \inf_{x} \{\|x-y\|^2 - \varphi(x)\} (Engquist et al., 2019, Huang et al., 2024, Korotin et al., 2019, Berthet et al., 2020). For μ\mu \ll Lebesgue, the Monge formulation seeks a transport map TT pushing μ\mu to ν\nu while minimizing xT(x)2dμ(x)\int \|x-T(x)\|^2 d\mu(x).

Displacement interpolation yields constant-speed geodesics: ρt=((1t)id+tT)#μ,T=φ (Brenier map)\rho_t = ((1-t)\mathrm{id} + tT)_\#\mu, \quad T = \nabla\varphi \text{ (Brenier map)} This underlies the formal Riemannian structure on P2\mathcal{P}_2: tangent vectors at μ\mu are gradients ψ\nabla\psi with the inner product ψ1,ψ2dμ\int \langle \nabla\psi_1, \nabla\psi_2\rangle\, d\mu (Hamm et al., 2023).

In the dynamic (Benamou–Brenier) formulation,

W22(μ,ν)=inf(ρ,v)01 ⁣vt(x)2ρt(x)dxdtW_2^2(\mu, \nu) = \inf_{(\rho, v)} \int_0^1\!\int \|v_t(x)\|^2 \rho_t(x)\,dx\,dt

subject to tρ+(vρ)=0\partial_t \rho + \nabla\cdot(v \rho) = 0, ρ0=μ\rho_0 = \mu, ρ1=ν\rho_1 = \nu (Hamm et al., 2023, Peyre, 2011).

Gradient flows in (P2,W2)(\mathcal{P}_2, W_2) (e.g., minimizing W22(μ,ν)W_2^2(\mu, \nu) in μ\mu) correspond to solutions to ODEs of the form dYt=φμtν(Yt)dtdY_t = -\nabla\varphi_{\mu_t}^\nu(Y_t)dt, where φμtν\varphi_{\mu_t}^\nu is the Kantorovich potential (Huang et al., 2024). Exponential convergence rates in Wasserstein space can be established under convexity assumptions (Ren et al., 2020).

3. Linearization and Equivalence with Weighted Sobolev Norms

For distributions μ\mu and small perturbations ν=μ+δμ\nu = \mu + \delta\mu, W2W_2 relaxes to a negative weighted Sobolev norm: W2(μ,μ+δμ)=δμH(μ)+o(δμ)W_2(\mu, \mu + \delta\mu) = \|\delta\mu\|_{H(\mu)} + o(\|\delta\mu\|) where

νH(μ)=sup{fdν:f2dμ1}\|\nu\|_{H(\mu)} = \sup\left\{\int f\, d\nu: \int |\nabla f|^2 d\mu \leq 1\right\}

(Peyre, 2011, Greengard et al., 2022, Engquist et al., 2019). For smooth reference measure μ\mu, W2W_2 is equivalent (up to explicit factors) to this dual Sobolev norm, justifying its use in analytic and geometric arguments.

A quantitative version is: W2(μ,ν)ϵδμH˙1(dμ)Cϵ2|\,W_2(\mu, \nu) - \epsilon\|\delta\mu\|_{\dot H^{-1}(d\mu)}\,| \leq C\epsilon^2 for g=(1+ϵu)fg = (1+\epsilon u)f near ff, where H˙1(dμ)\|\cdot\|_{\dot H^{-1}(d\mu)} is the weighted norm arising from the linearized Monge–Ampère equation (Greengard et al., 2022).

Localization results show that W2(ϕμ,ϕν)W_2(\phi\mu, \phi\nu) can be bounded above by an explicit multiple of W2(μ,ν)W_2(\mu, \nu), for suitable bump functions ϕ\phi (Peyre, 2011).

4. Statistical and Computational Aspects

For empirical distributions from i.i.d. samples X1,,XnN(0,1)X_1, \ldots, X_n \sim \mathcal{N}(0,1), the rate of convergence for the mean W2W_2 distance to the true law is

E[W2(Fn,Φ)]=loglognn(1+o(1))\mathbb{E}[W_2(\mathbb{F}_n, \Phi)] = \sqrt{\frac{\log\log n}{n}}\,(1 + o(1))

slower than the classical n1/2n^{-1/2} rate, due to large Gaussian tail fluctuations. For two correlated samples, the weak convergence rate reverts to 1/n1/\sqrt{n} (Berthet et al., 2020).

Computational solvers include:

Algorithmic variants exploit the convexity structure induced by W2W_2 in parameter spaces: for instance, the W2W_2 loss over affine-Gaussian families is globally convex, and its gradient is a preconditioned version of the L2L^2 gradient, leading to smoother, better-conditioned optimization landscapes (Engquist et al., 2019).

5. Applications Across Fields

(A) Inverse Problems: W2W_2 provides robustness against high-frequency data noise, leading to smoothing effects in inversion but reduced spatial resolution. Compared with L2L^2, W2W_2 yields more favorable convexity properties in parameter spaces of practical inverse problems (Engquist et al., 2019).

(B) Machine Learning and Generative Modeling: Wasserstein-2 metrics underpin algorithms in unsupervised learning, such as W2-GAN and non-minimax training of optimal transport maps via ICNNs; these models demonstrate advantages in image translation, style transfer, and domain adaptation tasks (Korotin et al., 2019, Huang et al., 2024, Oh et al., 2019).

(C) Stochastic Processes: W2W_2 is the natural metric for quantifying convergence of distributions in mean-field SDEs, McKean–Vlasov equations, and for bounding control errors in SDE parameter inference (Huang et al., 2024, Ren et al., 2020, Xia et al., 2024).

(D) Manifold and Geometric Learning: The 2-Wasserstein distance encodes a Riemannian geometry on the space of absolutely continuous measures, allowing the recovery of tangent spaces and geodesic structures in data-driven manifold learning (Hamm et al., 2023).

(E) Geometric Measure Theory: Localized variants lead to necessary and sufficient characterizations of rectifiability; the square-integrable α2\alpha_2 numbers, based on local W2W_2 flatness, provide a scale-invariant, transport-based criterion for nn-rectifiability (Dąbrowski, 2019).

6. Recent Extensions and Variants

Relative-translation invariant Wasserstein (RW2RW_2):

W22(μ,ν)=RW22(μ,ν)+μˉνˉ2W_2^2(\mu, \nu) = RW_2^2(\mu, \nu) + \|\bar{\mu} - \bar{\nu}\|^2

where RW2RW_2 is minimized over all translations. This provides a bias-variance decomposition of distribution shift, practical robustness to global translations, and enables efficient computation via a barycenter alignment step followed by standard Sinkhorn iterations (Wang et al., 2024).

Kernel Wasserstein Distance: For data in nonlinear feature spaces, W2W_2 may be computed in a reproducing kernel Hilbert space (RKHS) using empirical mean and covariance embeddings, with practical success in imaging clustering and artifact detection (Oh et al., 2019).

7. Theoretical and Practical Considerations

The W2W_2 metric supports:

  • Stability: Small L2L^2 or H1H^{-1} perturbations correspond to small changes in W2W_2, with explicit constants under density and curvature conditions (Peyre, 2011, Greengard et al., 2022, Engquist et al., 2019).
  • Localization: The W2W_2 distance is stable under restriction to subsets via smooth cutoff functions (Peyre, 2011).
  • High-dimensional robustness: W2W_2 maintains interpretability and computational feasibility in high dimensions via regularized and neural approaches (Korotin et al., 2019, Engquist et al., 2019).
  • Limitations: Local Gaussian or RKHS-based approximations lose fine multimodal structure; computational cost for exact W2W_2 scales cubically in point count but is mitigated by Sinkhorn and neural methods (Oh et al., 2019).

In summary, the Wasserstein-2 distance and its associated geometries form the backbone of modern optimal transport theory and its applications, offering precise metrics, algorithmic tractability, and a pathway to interpretability across modern data-driven disciplines (Berthet et al., 2020, Engquist et al., 2019, Greengard et al., 2022, Wang et al., 2024, Huang et al., 2024, Hamm et al., 2023, Ren et al., 2020, Oh et al., 2019, Dąbrowski, 2019, Peyre, 2011).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wasserstein-2 Distance ($W_2$).