Papers
Topics
Authors
Recent
2000 character limit reached

Wasserstein Distance: Theory & Applications

Updated 8 December 2025
  • Wasserstein distance is a metric that quantifies discrepancies between probability measures by solving an optimal mass transport problem.
  • It encompasses both deterministic (Monge) and relaxed (Kantorovich) formulations, providing a geometric framework for comparing distributions.
  • Its applications span imaging, machine learning, PDE analysis, and data science, addressing issues like mass fluctuations and stability.

The Wasserstein distance, also known as the optimal transport (OT) distance or Earth Mover's Distance (EMD) in specific cases, is a fundamental metric that quantifies the discrepancy between probability measures by solving a mass transportation problem. It provides a geometric framework for comparing distributions and has found wide application across mathematics, probability theory, statistics, machine learning, computer vision, and the analysis of partial differential equations.

1. Mathematical Definition and Foundational Principles

Let (X,d)(\mathcal{X}, d) be a complete separable metric space, and let Pp(X)\mathcal{P}_p(\mathcal{X}) denote the set of Borel probability measures on X\mathcal{X} with finite pthp^{\text{th}} moment. For μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X}), the pp–Wasserstein distance is defined as

Wp(μ,ν)=(infγΓ(μ,ν)X×Xd(x,y)pdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}

where Γ(μ,ν)\Gamma(\mu, \nu) is the set of all transport plans (couplings) with marginals μ\mu and ν\nu.

The p=1p=1 case admits the Kantorovich–Rubinstein dual representation: W1(μ,ν)=supfLip1{fdμfdν}W_1(\mu, \nu) = \sup_{\|f\|_{\text{Lip}} \le 1} \left\{ \int f\, d\mu - \int f\, d\nu \right\} where fLip\|f\|_{\text{Lip}} denotes the Lipschitz constant of ff (Piccoli et al., 2013).

The Wasserstein distance is a bona fide metric on Pp(X)\mathcal{P}_p(\mathcal{X}), satisfying nonnegativity, symmetry, identity of indiscernibles, and the triangle inequality (Panaretos et al., 2018). For WpW_p to be finite, both measures must have finite pp–moment.

2. Interpretation, Variants, and Duality

Monge and Kantorovich Formulations

The OT formulation seeks the least-cost way of transporting one distribution to another. The Monge problem requires a deterministic transport map TT minimizing xT(x)pdμ(x)\int |x-T(x)|^p\, d\mu(x), while Kantorovich's relaxation allows for couplings γ\gamma and always achieves a minimum (Piccoli et al., 2013).

Benamou–Brenier Dynamical Characterization

For p=2p=2, there is a dynamic fluid-mechanical representation: W22(μ0,μ1)=inf(μt,vt)01Xvt(x)2dμt(x)dtW_2^2(\mu_0, \mu_1) = \inf_{(\mu_t, v_t)} \int_0^1 \int_{\mathcal{X}} |v_t(x)|^2\, d\mu_t(x)\,dt subject to the continuity equation

tμt+(μtvt)=0,μt=0=μ0, μt=1=μ1\partial_t \mu_t + \nabla \cdot (\mu_t v_t) = 0,\qquad \mu_{t=0} = \mu_0,~ \mu_{t=1} = \mu_1

(Piccoli et al., 2013).

The Flat Metric

For arbitrary (possibly unequal mass) Radon measures, the generalized Wasserstein distance W11,1W_1^{1,1} coincides with the flat (bounded-Lipschitz) metric: W11,1(μ,ν)=supfCc, f1, Lip(f)1fd(μν)W_1^{1,1}(\mu, \nu) = \sup_{f \in C_c,~ \|f\|_\infty \le 1,~ \text{Lip}(f) \le 1} \int f\, d(\mu-\nu) (Piccoli et al., 2013).

3. Extensions and Computational Methods

Generalized Wasserstein Distance

For measures μ,ν\mu, \nu of possibly differing total mass and parameters a,b>0a, b > 0, the generalized Wasserstein distance is defined as

Wpa,b(μ,ν):=(infμ~,ν~Map(μμ~+νν~)p+bpWpp(μ~,ν~))1/pW_p^{a,b}(\mu, \nu) := \left( \inf_{\tilde\mu, \tilde\nu \in \mathcal{M}} a^p(|\mu - \tilde\mu| + |\nu - \tilde\nu|)^p + b^p W_p^p(\tilde\mu, \tilde\nu) \right)^{1/p}

where μμ~|\mu - \tilde\mu| is the total variation of the "removed" mass, and the infimum is over pairs with equal total mass. The aa term penalizes creation/removal, bb the transport, and pp controls aggregation (Piccoli et al., 2013).

Generalized Benamou–Brenier Formula

A dynamic formulation extends to W2a,bW_2^{a,b}: W2a,b(μ0,μ1)2=inf(μ,v,h)V(μ0,μ1)a2(01ht(Rd)dt)2+b201Rdvt(x)2dμt(x)dtW_2^{a, b}(\mu_0, \mu_1)^2 = \inf_{(\mu, v, h) \in \mathcal{V}(\mu_0, \mu_1)} a^2\left( \int_0^1 |h_t|(\mathbb{R}^d) dt \right)^2 + b^2 \int_0^1 \int_{\mathbb{R}^d} |v_t(x)|^2\, d\mu_t(x) dt where hh encodes sources/sinks and the continuity equation has a source term tμt+(μtvt)=ht\partial_t \mu_t + \nabla \cdot (\mu_t v_t) = h_t. This subsumes pure mass transport and allows for creation/removal (Piccoli et al., 2013).

Existence and Homogeneity

Wpa,bW_p^{a,b} is a metric on the cone of nonnegative Radon measures, is homogeneous Wpa,b(cμ,cν)=cWpa,b(μ,ν)W_p^{a,b}(c\mu, c\nu) = c W_p^{a,b}(\mu, \nu) for any c>0c>0, and attains its infimum for each pair of measures (Piccoli et al., 2013).

4. Analytical and Practical Properties

Mass Mismatch and Total Variation

When a0a \rightarrow 0 and b>0b > 0, Wpa,bW_p^{a,b} reduces to the pure WpW_p, and when b0b \rightarrow 0, it reduces to the total variation norm. For p=1p=1, a=b=1a=b=1, the equality W11,1(μ,ν)=d(μ,ν)W_1^{1,1}(\mu, \nu) = d(\mu, \nu) (flat metric) holds (Piccoli et al., 2013). Explicitly, for μ=δ0\mu = \delta_0, ν=αδx\nu = \alpha \delta_x,

W1a,b(μ,ν)=inf0mmin(1,α)a(α+12m)+bmxW_1^{a,b}(\mu, \nu) = \inf_{0 \le m \le \min(1,\alpha)} a(\alpha + 1 - 2m) + b m |x|

exhibiting the tradeoff between removal/addition and transportation costs.

Connection to Partial Differential Equations

Wasserstein distances and their generalizations are especially relevant for evolution equations such as the continuity equation with source, where one typically needs to compare measures of variable mass. The Wpa,bW_p^{a,b} framework is adapted to these contexts and yields contraction or stability estimates even for solutions that do not preserve total mass (Piccoli et al., 2013).

Limits and Interpolations

Wpa,bW_p^{a,b} provides a continuous interpolation between L1L^1 distance (as b0b \to 0, penalizing all transport) and the classical Wasserstein distance (as a0a \to 0, no penalty for creation/removal). This is particularly valuable in applications such as comparing histograms of unequal mass—common in imaging and statistical data analysis.

5. Theoretical and Algorithmic Framework

Fenchel–Legendre Duality

The proof of the equivalence between the W11,1W_1^{1,1} and the flat metric relies on convex analysis and Fenchel–Legendre duality: the sum of convex indicators for f1\|f\|_\infty \le 1 and Lip(f)1\text{Lip}(f) \le 1 leads, via a theorem of Rockafellar, to a dual representation that exactly matches the primal W11,1W_1^{1,1} definition (Piccoli et al., 2013).

Algorithmic Considerations

For p=2p=2, the dynamic programming Benamou–Brenier approach yields an explicit minimization over velocity fields and source terms. The infimum is realized, and the action can be constructed explicitly through "sample-and-hold" schemes that alternate between mass removal, transport, and creation in small time intervals. Convexity and stability under flow are key technical lemmas supporting these constructions.

Examples of Computation

For measures concentrated on points with different masses, optimal decomposition may entail only mass removal/addition, only transport, or a mixture, determined by the ratio bx/2ab|x|/2a. If bx>2ab|x| > 2a, it's optimal to remove/add all; otherwise, it pays to transport part of the mass.

6. Applications and Implications

Imaging, Data Analysis, and Beyond

The generalized Wasserstein metric Wpa,bW_p^{a,b} allows meaningful comparison of data distributions (histograms, point clouds) with mass fluctuations. This is essential in image processing and vision, where illumination or occlusion can alter total mass, and in statistical analysis of data sets with missing data or over-sampling.

PDE Theory and Contractivity

Wpa,bW_p^{a,b} has enabled new existence and stability results for evolution equations with source terms, accommodating solutions where total mass is not preserved, and guaranteeing meaningful contractivity in this extended framework (Piccoli et al., 2013).

Hierarchical Relation to WpW_p

Wpa,bW_p^{a,b} recovers WpW_p and the total variation metric in limits and thus underlies a unifying theory for purely geometric transport and purely mass error terms.


References:

  • Piccoli, B. & Rossi, F. "On properties of the Generalized Wasserstein distance" (Piccoli et al., 2013)
  • Benamou, J.-D. & Brenier, Y. "A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem"
  • Villani, C. "Optimal Transport: Old and New," Springer

This summarization encapsulates the structure, properties, dualities, analytical formulations, and key application domains of the classical and generalized Wasserstein distances as rigorously delineated in (Piccoli et al., 2013).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Wasserstein Distance.