1-Wasserstein Distance Overview
- 1-Wasserstein Distance is a metric that measures the minimal transport cost required to morph one probability distribution into another.
- It leverages dual formulations and explicit closed-form expressions in one-dimensional and discrete settings to facilitate optimal transport analysis.
- Efficient computational methods, including linear programming, primal-dual schemes, and tree-based embeddings, enable scalable approximations for high-dimensional data.
The 1-Wasserstein distance is a foundational metric in optimal transport theory, quantifying the minimal effort required to morph one probability distribution into another with respect to a given cost—typically, the Euclidean or distance. Recognized equivalently as the Earth Mover’s Distance (EMD), is central across probability, statistics, machine learning, and computational geometry due to its ability to encode fine-grained geometric properties of distributions, support weak convergence analysis, and underpin practical algorithms for comparing and interpolating measures.
1. Formal Definitions and Dual Representations
Given a complete separable metric space and probability measures with finite first moments, the 1-Wasserstein distance is defined as
where is the set of all couplings on with marginals and (Panaretos et al., 2018). Probabilistically, this infimum is over joint laws of with and .
Kantorovich–Rubinstein duality states: where the supremum runs over all 1-Lipschitz (real-valued) functions on (Panaretos et al., 2018, Coutin et al., 2019). This dual form underpins statistical applications and algorithmic relaxations (e.g., WGANs (Stéphanovitch et al., 2022)).
In continuous settings, the dynamic or flux formulation provides: reflecting minimal transportation cost as a flow with prescribed divergence (Chen et al., 2017).
2. One-Dimensional and Discrete Closed-Form Expressions
When , admits an explicit formula: where is the cumulative distribution function and the quantile function (Angelis et al., 2021, Panaretos et al., 2018). This coincides geometrically with the area between the two CDFs. The copula-theoretic derivation confirms that the optimal coupling pairs corresponding quantiles ("comonotonic coupling") achieves this minimum (Angelis et al., 2021).
For the finite discrete simplex with ground metric , one obtains: which is the norm between their cumulative sums (empirical CDFs) (Frohmader et al., 2019).
3. Multivariate, Matrix, and Semi-Discrete Generalizations
For multivariate , generalizes to
and supports dual formulations via 1-Lipschitz test functions. In the space of density matrices with quantum extensions, is extended via noncommutative gradients and nuclear norm minimization, yielding practical convex optimization problems for matrix-valued or power-spectral data (Chen et al., 2017).
The semi-discrete regime, as encountered in WGANs, requires analysis where one measure is continuous and the other atomic. Existence and structure of optimal transport maps can then be described via power diagrams/Voronoi partitioning, with minimizers corresponding to shortest-paths or weighted cell equalization (Stéphanovitch et al., 2022).
4. Statistical Properties and Convergence Behavior
The empirical between i.i.d. samples and the parent law satisfies almost sure consistency if . On the real line, the optimal expected convergence rate is , provided integrability conditions on are met. In higher dimensions , the rate is , reflecting the curse of dimensionality (Panaretos et al., 2018, Stéphanovitch et al., 2022). For random matrix spectra, convergence can even be accelerated due to eigenvalue repulsion ( for Ginibre, compared to for i.i.d. points) (Jalowy, 2021).
Explicit expressions for in location–scale families (e.g., Gaussians, Laplace) are available. For univariate , ,
with , and closed-forms obtained for folded normal, Laplace, and other base distributions (Chhachhi et al., 2023).
5. Computational Algorithms and Approximations
Linear Programming Methods
For discrete measures, is computable as a min-cost flow or transportation linear program of size , with classical solutions scaling as (Panaretos et al., 2018).
Primal-Dual and PDE-Based Solvers
For continuous densities on computational domains (e.g., images), primal-dual schemes (e.g., Chambolle–Pock) and PDE discretizations (e.g., Monge–Ampère for ) provide scalable solutions. Multilevel approaches drastically reduce computational time, achieving complexity for 2D grids, with real-world performance vastly outperforming traditional flow algorithms for large (Liu et al., 2018, Snow et al., 2016).
Approximation via Trees and Embeddings
Tree-based embeddings (Tree-Wasserstein distance) yield -time approximate computations by fitting edge weights (via nonnegative Lasso) on tree metrics to match the underlying geometry of the data space, with strong empirical fidelity to exact even in NLP/CV settings (Yamada et al., 2022).
Near-Linear Time for Specialized Structures
For persistence diagrams, quadtree-based -embedding and flowtree algorithms yield time approximations within factors of the exact value, with high empirical accuracy relative to previous exact or auction-based methods. These embeddings allow for fast nearest-neighbor searches and compact representations in TDA pipelines (Chen et al., 2021).
6. Applications and Impact in Statistics and Machine Learning
The metric is fundamental for:
- Evaluating generative models, including as losses for Wasserstein GANs, which rely critically on the dual form for stable training and meaningful gradients (Stéphanovitch et al., 2022).
- Goodness-of-fit, two-sample, and independence testing, where -based test statistics exhibit greater sensitivity to global and local distributional differences compared to classical EDF-based approaches (Panaretos et al., 2018).
- Image retrieval and classification, as captures geometric similarity and is robust under small deformations, outperforming Euclidean or tangent-space metrics for low-sample discriminative tasks (Snow et al., 2016).
- Analysis of random matrices, quantifying spectral convergence to limiting distributions under nontrivial dependencies (Jalowy, 2021).
- Differential privacy, providing closed-form and tight upper bounds for distributional shifts induced by Laplace or Gaussian noise mechanisms (Chhachhi et al., 2023).
7. Extensions, Limitations, and Future Directions
Extensions of include unbalanced transport (allowing creation or destruction of mass), noncommutative generalizations, and metric learning in embedded spaces (Chen et al., 2017, Yamada et al., 2022). Open directions concern high-dimensional scaling, entropic regularization for , and further acceleration on specialized architectures (GPU, parallelism).
The primary limitations are computational: naive methods are intractable for large ; however, ongoing work leverages geometric, algebraic, and approximate optimization strategies to make viable for large-scale inference, geometry, and data analysis (Liu et al., 2018, Yamada et al., 2022).
References:
(Panaretos et al., 2018) Statistical Aspects of Wasserstein Distances (Angelis et al., 2021) Why the 1-Wasserstein distance is the area between the two marginal CDFs (Frohmader et al., 2019) 1-Wasserstein Distance on the Standard Simplex (Liu et al., 2018) Multilevel Optimal Transport: a Fast Approximation of Wasserstein-1 distances (Yamada et al., 2022) Approximating 1-Wasserstein Distance with Trees (Chen et al., 2021) Approximation algorithms for 1-Wasserstein distance between persistence diagrams (Snow et al., 2016) Monge's Optimal Transport Distance for Image Classification (Chhachhi et al., 2023) On the 1-Wasserstein Distance between Location-Scale Distributions and the Effect of Differential Privacy (Jalowy, 2021) The Wasserstein distance to the Circular Law (Stéphanovitch et al., 2022) Optimal 1-Wasserstein Distance for WGANs (Chen et al., 2017) Matricial Wasserstein-1 Distance (Coutin et al., 2019) Donsker's theorem in {Wasserstein}-1 distance