Papers
Topics
Authors
Recent
2000 character limit reached

1-Wasserstein Distance Overview

Updated 26 November 2025
  • 1-Wasserstein Distance is a metric that measures the minimal transport cost required to morph one probability distribution into another.
  • It leverages dual formulations and explicit closed-form expressions in one-dimensional and discrete settings to facilitate optimal transport analysis.
  • Efficient computational methods, including linear programming, primal-dual schemes, and tree-based embeddings, enable scalable approximations for high-dimensional data.

The 1-Wasserstein distance is a foundational metric in optimal transport theory, quantifying the minimal effort required to morph one probability distribution into another with respect to a given cost—typically, the Euclidean or 1\ell_1 distance. Recognized equivalently as the Earth Mover’s Distance (EMD), W1W_1 is central across probability, statistics, machine learning, and computational geometry due to its ability to encode fine-grained geometric properties of distributions, support weak convergence analysis, and underpin practical algorithms for comparing and interpolating measures.

1. Formal Definitions and Dual Representations

Given a complete separable metric space (X,d)(\mathcal{X}, d) and probability measures μ,ν\mu, \nu with finite first moments, the 1-Wasserstein distance is defined as

W1(μ,ν)=infπΓ(μ,ν)X×Xd(x,y)π(dx,dy)W_1(\mu, \nu) = \inf_{\pi \in \Gamma(\mu, \nu)} \int_{\mathcal{X} \times \mathcal{X}} d(x, y) \, \pi(dx, dy)

where Γ(μ,ν)\Gamma(\mu, \nu) is the set of all couplings on X×X\mathcal{X} \times \mathcal{X} with marginals μ\mu and ν\nu (Panaretos et al., 2018). Probabilistically, this infimum is over joint laws of (X,Y)(X, Y) with XμX \sim \mu and YνY \sim \nu.

Kantorovich–Rubinstein duality states: W1(μ,ν)=supfLip1{fdμfdν}W_1(\mu, \nu) = \sup_{\|f\|_{\mathrm{Lip}} \le 1} \left\{ \int f\,d\mu - \int f\,d\nu \right\} where the supremum runs over all 1-Lipschitz (real-valued) functions on X\mathcal{X} (Panaretos et al., 2018, Coutin et al., 2019). This dual form underpins statistical applications and algorithmic relaxations (e.g., WGANs (Stéphanovitch et al., 2022)).

In continuous settings, the dynamic or flux formulation provides: W1(μ,ν)=infu{Xu(x)dx:u=μν}W_1(\mu, \nu) = \inf_{u} \left\{ \int_{\mathcal{X}} \|u(x)\|\,dx : \nabla \cdot u = \mu - \nu \right\} reflecting minimal transportation cost as a flow with prescribed divergence (Chen et al., 2017).

2. One-Dimensional and Discrete Closed-Form Expressions

When X=R\mathcal{X} = \mathbb{R}, W1W_1 admits an explicit formula: W1(μ,ν)=01Fμ1(u)Fν1(u)du=RFμ(t)Fν(t)dtW_1(\mu, \nu) = \int_{0}^{1} |F_{\mu}^{-1}(u) - F_{\nu}^{-1}(u)|\,du = \int_{\mathbb{R}} |F_\mu(t) - F_\nu(t)|\,dt where FμF_\mu is the cumulative distribution function and Fμ1F_\mu^{-1} the quantile function (Angelis et al., 2021, Panaretos et al., 2018). This coincides geometrically with the area between the two CDFs. The copula-theoretic derivation confirms that the optimal coupling pairs corresponding quantiles ("comonotonic coupling") achieves this minimum (Angelis et al., 2021).

For the finite discrete simplex Ω=Δn1\Omega = \Delta^{n-1} with ground metric d(i,j)=ijd(i, j) = |i-j|, one obtains: W1(μ,ν)=k=1ni=1k(μiνi)W_1(\mu, \nu) = \sum_{k=1}^n \left| \sum_{i=1}^k (\mu_i - \nu_i) \right| which is the 1\ell_1 norm between their cumulative sums (empirical CDFs) (Frohmader et al., 2019).

3. Multivariate, Matrix, and Semi-Discrete Generalizations

For multivariate X,YRdX, Y \in \mathbb{R}^d, W1W_1 generalizes to

W1(μ,ν)=infπΓ(μ,ν)Rd×Rdxyπ(dx,dy)W_1(\mu, \nu) = \inf_{\pi \in \Gamma(\mu, \nu)} \int_{\mathbb{R}^d \times \mathbb{R}^d} \|x - y\| \, \pi(dx, dy)

and supports dual formulations via 1-Lipschitz test functions. In the space of n×nn \times n density matrices with quantum extensions, W1W_1 is extended via noncommutative gradients and nuclear norm minimization, yielding practical convex optimization problems for matrix-valued or power-spectral data (Chen et al., 2017).

The semi-discrete regime, as encountered in WGANs, requires analysis where one measure is continuous and the other atomic. Existence and structure of optimal transport maps can then be described via power diagrams/Voronoi partitioning, with minimizers corresponding to shortest-paths or weighted cell equalization (Stéphanovitch et al., 2022).

4. Statistical Properties and Convergence Behavior

The empirical W1W_1 between i.i.d. samples and the parent law satisfies almost sure consistency if Ed(x0,X)<\mathbb{E}d(x_0, X) < \infty. On the real line, the optimal expected convergence rate is O(n1/2)O(n^{-1/2}), provided integrability conditions on FμF_\mu are met. In higher dimensions dd, the rate is O(n1/d)O(n^{-1/d}), reflecting the curse of dimensionality (Panaretos et al., 2018, Stéphanovitch et al., 2022). For random matrix spectra, W1W_1 convergence can even be accelerated due to eigenvalue repulsion (n1/2n^{-1/2} for Ginibre, compared to (logn/n)1/2(\log n/n)^{1/2} for i.i.d. points) (Jalowy, 2021).

Explicit expressions for W1W_1 in location–scale families (e.g., Gaussians, Laplace) are available. For univariate X1=α1+β1ZX_1 = \alpha_1 + \beta_1 Z, X2=α2+β2ZX_2 = \alpha_2 + \beta_2 Z,

W1(X1,X2)=E[Y]W_1(X_1, X_2) = \mathbb{E}[|Y|]

with Y=(α1α2)+(β1β2)ZY = (\alpha_1 - \alpha_2) + (\beta_1 - \beta_2) Z, and closed-forms obtained for folded normal, Laplace, and other base distributions (Chhachhi et al., 2023).

5. Computational Algorithms and Approximations

Linear Programming Methods

For discrete measures, W1W_1 is computable as a min-cost flow or transportation linear program of size O(n2)O(n^2), with classical solutions scaling as O(n3logn)O(n^3 \log n) (Panaretos et al., 2018).

Primal-Dual and PDE-Based Solvers

For continuous densities on computational domains (e.g., images), primal-dual schemes (e.g., Chambolle–Pock) and PDE discretizations (e.g., Monge–Ampère for W2W_2) provide scalable solutions. Multilevel approaches drastically reduce computational time, achieving O(N1+N2logN)O(N^1 + N^2 \log N) complexity for 2D grids, with real-world performance vastly outperforming traditional flow algorithms for large NN (Liu et al., 2018, Snow et al., 2016).

Approximation via Trees and Embeddings

Tree-based embeddings (Tree-Wasserstein distance) yield O(N)O(N)-time approximate W1W_1 computations by fitting edge weights (via nonnegative Lasso) on tree metrics to match the underlying geometry of the data space, with strong empirical fidelity to exact W1W_1 even in NLP/CV settings (Yamada et al., 2022).

Near-Linear Time for Specialized Structures

For persistence diagrams, quadtree-based L1L_1-embedding and flowtree algorithms yield O(nlogΔ)O(n \log \Delta) time approximations within O(logΔ)O(\log \Delta) factors of the exact value, with high empirical accuracy relative to previous exact or auction-based methods. These embeddings allow for fast nearest-neighbor searches and compact representations in TDA pipelines (Chen et al., 2021).

6. Applications and Impact in Statistics and Machine Learning

The W1W_1 metric is fundamental for:

  • Evaluating generative models, including as losses for Wasserstein GANs, which rely critically on the dual form for stable training and meaningful gradients (Stéphanovitch et al., 2022).
  • Goodness-of-fit, two-sample, and independence testing, where W1W_1-based test statistics exhibit greater sensitivity to global and local distributional differences compared to classical EDF-based approaches (Panaretos et al., 2018).
  • Image retrieval and classification, as W1W_1 captures geometric similarity and is robust under small deformations, outperforming Euclidean or tangent-space metrics for low-sample discriminative tasks (Snow et al., 2016).
  • Analysis of random matrices, quantifying spectral convergence to limiting distributions under nontrivial dependencies (Jalowy, 2021).
  • Differential privacy, providing closed-form and tight upper bounds for distributional shifts induced by Laplace or Gaussian noise mechanisms (Chhachhi et al., 2023).

7. Extensions, Limitations, and Future Directions

Extensions of W1W_1 include unbalanced transport (allowing creation or destruction of mass), noncommutative generalizations, and metric learning in embedded spaces (Chen et al., 2017, Yamada et al., 2022). Open directions concern high-dimensional scaling, entropic regularization for W1W_1, and further acceleration on specialized architectures (GPU, parallelism).

The primary limitations are computational: naive methods are intractable for large nn; however, ongoing work leverages geometric, algebraic, and approximate optimization strategies to make W1W_1 viable for large-scale inference, geometry, and data analysis (Liu et al., 2018, Yamada et al., 2022).


References:

(Panaretos et al., 2018) Statistical Aspects of Wasserstein Distances (Angelis et al., 2021) Why the 1-Wasserstein distance is the area between the two marginal CDFs (Frohmader et al., 2019) 1-Wasserstein Distance on the Standard Simplex (Liu et al., 2018) Multilevel Optimal Transport: a Fast Approximation of Wasserstein-1 distances (Yamada et al., 2022) Approximating 1-Wasserstein Distance with Trees (Chen et al., 2021) Approximation algorithms for 1-Wasserstein distance between persistence diagrams (Snow et al., 2016) Monge's Optimal Transport Distance for Image Classification (Chhachhi et al., 2023) On the 1-Wasserstein Distance between Location-Scale Distributions and the Effect of Differential Privacy (Jalowy, 2021) The Wasserstein distance to the Circular Law (Stéphanovitch et al., 2022) Optimal 1-Wasserstein Distance for WGANs (Chen et al., 2017) Matricial Wasserstein-1 Distance (Coutin et al., 2019) Donsker's theorem in {Wasserstein}-1 distance

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to 1-Wasserstein Distance.