Papers
Topics
Authors
Recent
2000 character limit reached

Empirical Wasserstein-1 Distance Overview

Updated 26 November 2025
  • Empirical Wasserstein-1 distance is a metric measuring discrepancies between distributions via optimal transport with linear cost.
  • It provides precise asymptotic and finite-sample convergence rates that depend on support geometry and moment conditions.
  • Practical computation leverages 1D sorting, multivariate linear programming, and approximations like tree-based methods and neural networks.

The empirical Wasserstein-1 distance, also known as the empirical earth mover’s distance, quantifies the discrepancy between empirical and population distributions or between two empirical distributions via optimal transport with linear cost. This metric plays a central role in probability, statistical inference, machine learning, and high-dimensional data analysis due to its mathematical tractability and direct connection to geometry and coupling. Both the theory and practice of empirical Wasserstein-1 distance are well-developed, with precise asymptotics, non-asymptotic deviation bounds, and efficient computation/accessibility in various settings.

1. Formal Definition and One-Dimensional Characterization

For probability measures μ,νP1(E)\mu, \nu \in \mathcal{P}_1(E) on a Polish metric space (E,d)(E,d), the 1-Wasserstein distance is defined as: W1(μ,ν)=infγΓ(μ,ν)E×Ed(x,y)γ(dx,dy)W_{1}(\mu,\nu) = \inf_{\gamma \in \Gamma(\mu,\nu)} \int_{E \times E} d(x,y) \,\gamma(dx,dy) where Γ(μ,ν)\Gamma(\mu, \nu) is the set of all couplings of μ\mu and ν\nu. The Kantorovich-Rubinstein duality gives: W1(μ,ν)=supfLip1fdμfdνW_1(\mu, \nu) = \sup_{\|f\|_{\text{Lip}} \leq 1} \left| \int f\, d\mu - \int f\, d\nu \right| For empirical measures μn=1ni=1nδXi\mu_n = \frac{1}{n} \sum_{i=1}^n \delta_{X_i}, XiμX_i \sim \mu, W1(μn,μ)W_1(\mu_n, \mu) captures the optimal cost of transporting the empirical distribution to the true law.

On R\mathbb{R}, a fundamental property is: W1(F,G)=F(x)G(x)dx=01F1(u)G1(u)duW_1(F, G) = \int_{-\infty}^\infty |F(x) - G(x)|dx = \int_0^1 |F^{-1}(u) - G^{-1}(u)| du for cumulative distribution functions F,GF, G, and quantile functions F1,G1F^{-1}, G^{-1} (Angelis et al., 2021). This quantile formula underpins both practical computation (requiring only sorting) and theoretical analysis.

2. Asymptotic and Finite-Sample Rates of Convergence

The rate at which W1(μn,μ)0W_1(\mu_n, \mu) \rightarrow 0 as nn \to \infty depends on the geometry of the support and moment conditions.

E[W1(μn,μ)]{n1/d,d>2 n1/2logn,d=2 n1/2,d<2\mathbb{E}[W_1(\mu_n, \mu)] \asymp \begin{cases} n^{-1/d}, & d > 2 \ n^{-1/2} \log n, & d=2 \ n^{-1/2}, & d < 2 \end{cases}

The optimality of these rates is established via dyadic partition coupling and metric entropy arguments.

3. Limit Distributions and Weak Convergence

In dimension one with smooth density and regular tails, the plug-in statistic W1(Fn,Gn)W_1(F_n, G_n) satisfies functional central limit theorems:

n(W1(Fn,Gn)W1(F,G))N(0,σ2),\sqrt{n}\left( W_1(F_n, G_n) - W_1(F, G) \right) \Rightarrow \mathcal{N}(0, \sigma^2),

where σ2\sigma^2 is an explicit quadratic form involving Brownian bridges and the quantile process.

  • Goodness-of-fit case (F=G)(F = G): The standard n\sqrt{n}-CLT fails. The limiting distribution is non-Gaussian and the scaling rate is slower (controlled by the regular variation at zero of the cost function ρ(x)=x\rho(x) = |x|), specifically

n1/2/lognW1(Fn,F)01B(u)du,n^{1/2}/\log n \cdot W_1(F_n, F) \Rightarrow \int_0^1 | \mathcal{B}(u) | du,

where B\mathcal{B} is a standard Brownian bridge (Berthet et al., 2019).

  • Finite metric spaces: For discrete support, W1W_1 is the value of a random linear program. The asymptotic distribution is the maximum of linear forms in a Gaussian random vector over the dual constraint set, yielding a non-classical limit; naive bootstrap fails, and valid alternatives are derived (Sommerfeld et al., 2016).
  • General dimensions: Functional delta method and empirical process theory remain technically challenging; for d2d \geq 2, strong regularity is often required for CLTs.

4. Non-Asymptotic Deviation, Concentration, and Sample Complexity

Sharp deviation inequalities for W1(μn,μ)W_1(\mu_n, \mu) are available under moment or transport-entropy assumptions:

  • Deviation bounds: If μ\mu is sub-Gaussian, for any ϵ>0\epsilon > 0,

Pr(W1(μn,μ)ϵ)Cexp(cnϵ2)\Pr( W_1(\mu_n, \mu) \geq \epsilon ) \leq C \exp( -c n \epsilon^2 )

with explicit constants. For sub-exponential and heavy-tailed distributions, similar, though possibly polynomial, concentration rates hold (A. et al., 2019, Boissard, 2011, Fournier et al., 2013).

  • Transport-entropy inequalities: If μ\mu satisfies a T1(C)T_1(C)-inequality (Gaussian-type concentration in W1W_1), McDiarmid's bounded differences yield for all t>0t > 0,

P(W1(μn,μ)E[W1(μn,μ)]+t)exp(nt2/(8C))\mathbb{P}\bigl( W_1(\mu_n, \mu) \geq \mathbb{E}[W_1(\mu_n, \mu)] + t \bigr) \leq \exp\left( - n t^2 / (8C) \right)

(Boissard, 2011).

  • Sample complexity: Achieving W1(μn,μ)ϵW_1(\mu_n, \mu) \leq \epsilon with probability 1δ1 - \delta requires nϵ2log(1/δ)n \gtrsim \epsilon^{-2} \log(1/\delta) in sub-Gaussian settings (A. et al., 2019).

5. Computational Methods and Approximations

Efficient practical and approximate computation of empirical W1W_1 is critical in large-scale applications:

  • 1D exact computation: O(n)O(n) time via sorting and using either the empirical CDF or quantile formula (Angelis et al., 2021).
  • Multivariate exact computation: W1W_1 reduces to a linear program of size n×nn \times n; complexity is typically O(n3)O(n^3).
  • Tree-based Approximation (TWD): Embeds the data in a tree metric and solves a convex Lasso problem (nonnegative 1\ell_1-regularized regression) for optimal edge weights, yielding linear-time approximate computation with quantifiable accuracy. Variance is reduced via tree-slicing (averaging over trees) (Yamada et al., 2022).
  • Deep Network Approximation: In high-dimensional settings, the Lipschitz function class is approximated by 1-Lipschitz neural networks; the supremum in the dual representation is optimized over networks, enabling scalable hypothesis tests and confidence intervals via Gaussian multiplier bootstrap (Imaizumi et al., 2019).
Method Dimension Computational Cost
1D Sort+Pairing 1 O(nlogn)O(n\log n)
LP Solver dd O(n3)O(n^3) (network flow)
Tree-Wasserstein dd O(N)O(N) (tree nodes)
ReLU Network Dual dd O(nT)O(nT) (SGD/ADAM; TT bootstraps)

6. Statistical Inference, Hypothesis Testing, and Confidence Bands

Empirical Wasserstein-1 distance underpins a variety of inference schemes:

  • Hypothesis testing: Empirical W1W_1-based one- and two-sample tests with Gaussian process bootstrap calibration have correct Type I error and comparable or superior performance to alternatives, even on singular supports (Imaizumi et al., 2019).
  • Confidence intervals: Bootstrap quantiles of the supremum of Gaussian processes (approximating the empirical process indexed by 1-Lipschitz functions) yield valid CIs for W1W_1 and for functionals that are W1W_1-Lipschitz (Imaizumi et al., 2019, Sommerfeld et al., 2016, A. et al., 2019).
  • Applications: Used to rigorously quantify inter/intra-group distances in metagenomics and other high-dimensional histogram data, providing interpretable intervals and robust significance estimates even in challenging regimes (e.g., partially overlapping supports) (Sommerfeld et al., 2016).

7. Optimality, Quantization, and Theoretical Extensions

Empirical measures are, up to polylogarithmic factors, as effective as optimal uniform quantizers for 1-Wasserstein approximation:

  • Quantization error: The expected empirical Wasserstein-1 distance nearly matches the minimal error over all nn-point uniform quantizers up to a factor O(logn)O(\log n), the gap being sharply characterized with multiscale decomposition, chaining arguments, and metric entropy bounds (Boedihardjo, 4 Aug 2025).
  • Non-uniform quantizers: In many settings (e.g., absolutely continuous measures on Rd\mathbb{R}^{d}), the empirical quantization rate matches optimal non-uniform quantizers. However, for measures with small-mass fine-structure, polynomial factors may appear (Boedihardjo, 4 Aug 2025).
  • Transport-entropy connections: W1W_1-concentration unifies analysis of risk measures, generalizing CVaR and other quantile-related risk bounds to arbitrary LL-Lipschitz functionals (A. et al., 2019).
  • Open questions: The necessity of polylogarithmic rate gaps in general spaces, precise characterization for pp-Wasserstein, and performance for strongly singular or heavy-tailed distributions remain subjects of active research (Boedihardjo, 4 Aug 2025).

References

  • (Angelis et al., 2021) Why the 1-Wasserstein distance is the area between the two marginal CDFs.
  • (Berthet et al., 2019) Weak convergence of empirical Wasserstein type distances.
  • (Weed et al., 2017) Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance.
  • (Fournier et al., 2013) On the rate of convergence in Wasserstein distance of the empirical measure.
  • (Boedihardjo, 4 Aug 2025) Optimality of empirical measures as quantizers.
  • (A. et al., 2019) A Wasserstein distance approach for concentration of empirical risk estimates.
  • (Boissard, 2011) Simple bounds for the convergence of empirical and occupation measures in 1-Wasserstein distance.
  • (Yamada et al., 2022) Approximating 1-Wasserstein Distance with Trees.
  • (Imaizumi et al., 2019) Hypothesis Test and Confidence Analysis with Wasserstein Distance on General Dimension.
  • (Sommerfeld et al., 2016) Inference for Empirical Wasserstein Distances on Finite Spaces.
  • (Dedecker et al., 2018) Behavior of the empirical Wasserstein distance in Rd\mathbb{R}^d under moment conditions.
  • (Divol, 2021) A short proof on the rate of convergence of the empirical measure for the Wasserstein distance.
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Empirical Wasserstein-1 Distance.