Earth Mover's Distance (EMD)

Updated 2 May 2026

Earth Mover's Distance (EMD) is a robust metric from optimal transport theory that measures the minimal cost to convert one probability distribution into another.
It underpins applications in computer vision, machine learning, and signal processing, driving tasks like image retrieval and generative modeling.
Recent advancements include efficient approximation algorithms and differentiable implementations that enable scalable integration with deep learning architectures.

The Earth Mover's Distance (EMD), also called the first Wasserstein distance or 1-Wasserstein metric, quantifies the minimal "cost" of transforming one probability distribution into another, where cost is defined via a ground metric on the underlying space. EMD originated in mass-transportation theory (Monge–Kantorovich framework) and is fundamentally related to optimal transport problems, wherein distributions are viewed as piles of mass that need to be reshaped optimally. EMD has become a fundamental dissimilarity measure in computer vision, machine learning, computational geometry, and statistics due to its sensitivity to the underlying ground distance and its meaningful geometric structure.

1. Mathematical Definitions and Generalizations

Let $X = \{u_1, \ldots, u_m\}$ , $Y = \{v_1, \ldots, v_n\}$ be two sets of points in a metric space (typically $\mathbb{R}^d$ ), equipped with non-negative weights ( $w^{U}, w^{V}$ ) normalized to total mass 1, and let $C_{ij} = d(u_i, v_j)$ denote the ground cost of moving a unit of mass from $u_i$ to $v_j$ . The EMD is the solution to the transportation linear program:

$\mathrm{EMD}(X, Y) = \min_{F \ge 0} \sum_{i=1}^m \sum_{j=1}^n F_{ij} C_{ij}$

subject to: $\begin{cases} \sum_{j=1}^n F_{ij} = w^{U}_i, & i=1,\dots,m \ \sum_{i=1}^m F_{ij} = w^{V}_j, & j=1,\dots,n \ \sum_{i=1}^m \sum_{j=1}^n F_{ij} = 1 \end{cases}$

For the one-dimensional case with the ground distance $|i - j|$ , EMD admits a closed-form via cumulative distributions: $Y = \{v_1, \ldots, v_n\}$ 0, where $Y = \{v_1, \ldots, v_n\}$ 1 and $Y = \{v_1, \ldots, v_n\}$ 2 (Erickson, 2024, Bourn et al., 2019).

EMD naturally generalizes to multiple ( $Y = \{v_1, \ldots, v_n\}$ 3) distributions, leading to the "multiway" EMD or multi-marginal optimal transport, with explicit formulas for expected values and new Cayley–Menger–type relations between multiway and pairwise EMDs (Erickson, 2024, Erickson, 2023, Erickson, 2020).

2. Algorithmic and Computational Properties

Exact EMD computation requires solving a network flow or assignment problem; for $Y = \{v_1, \ldots, v_n\}$ 4, complexity is typically $Y = \{v_1, \ldots, v_n\}$ 5 (Hungarian algorithm) or, with recent improvements, $Y = \{v_1, \ldots, v_n\}$ 6 (Sinha et al., 2023, Rohatgi, 2019). The fundamental hardness of EMD stems from its quadratic bottleneck in the size of the involved histograms or point clouds. Life is further complicated in high dimensions ( $Y = \{v_1, \ldots, v_n\}$ 7), where even approximate computation becomes hard: all known algorithms require time superlinear in $Y = \{v_1, \ldots, v_n\}$ 8 and exponential in $Y = \{v_1, \ldots, v_n\}$ 9 unless fine-grained complexity conjectures (Orthogonal Vectors or Hitting Set) fail (Rohatgi, 2019).

To address these costs, large-scale or approximate EMD algorithms have been developed:

Sinkhorn Distance (entropic regularization): replaces the original optimal transport LP with a regularized version, solved by iterative matrix scaling, giving an $\mathbb{R}^d$ 0 algorithm, but with trade-offs in numerical stability for large regularization parameter and instabilities in GPU implementations at large scale (Martinez et al., 2016).
Approximate and Parallel Schemes: Linear-complexity variants based on nearest-neighbor search (NNS-EMD), iterative constrained transfers (ICT/ACT), and overlapping mass reduction (OMR) obtain substantial speedups ( $\mathbb{R}^d$ 1– $\mathbb{R}^d$ 2× in empirical wall-clock) with error rates below 1% for image, point cloud, and histogram comparison tasks (Meng et al., 2024, Atasu et al., 2018).
Data-Dependent Locality-Sensitive Hashing: Achieves optimal $\mathbb{R}^d$ 3 approximation in nearest neighbor search for EMD in high-dimensional metric spaces, improving the previous quadratic dependence to nearly linear (Jayaram et al., 2024).

3. Differentiable and Deep Learning Integrations

Recent years have seen the development of differentiable EMD implementations, enabling scalable end-to-end optimization for deep learning:

DeepEMD models embed a differentiable EMD computation as a neural network layer by differentiating through the KKT conditions of the LP or by training a deep neural network (CNN or transformer) surrogate using optimal transport matches as supervision (Zhang et al., 2020, Sinha et al., 2023, Shenoy et al., 2023). These approaches yield practical EMD surrogates that are orders of magnitude faster than the exact Hungarian method and outperform pixelwise (MSE) or Chamfer distances as loss functions for tasks such as few-shot classification, point-cloud generative modeling, or calorimeter data compression.
Explicit and Closed-form Gradients: For chain- or tree-connected output spaces, closed-form expressions for EMD and its gradient enable direct computation and stable backpropagation, especially useful for small-data or hierarchical label scenarios (Martinez et al., 2016).
Structure-Aware Losses: Differentiable EMD is leveraged to encode semantic or hierarchical structure in output spaces, yielding better generalization in classification with semantic hierarchies (Martinez et al., 2016). In BERT compression, EMD is central to many-to-many mapping between teacher and student layers during distillation, leading to empirically improved transfer and faster convergence (Li et al., 2020).

4. Theoretical Properties and Extensions

EMD is a metric on probability distributions provided the ground cost is a metric, and it can be extended to measures with unequal mass by introducing "dummy" nodes or by penalizing total mass differences. The set-theoretic perspective reveals EMD as the minimum cost to transform the support of one set into another, interpolating between set difference (for indicator functions and discrete cost) and general histogram comparison. The Earth Mover’s Intersection (EMI) is a positive-definite analog to EMD for the kernel learning context, and positive-definite kernels are constructed by exponentiating EMD or composing it with intersection kernels or Jaccard-type normalizations (Gardner et al., 2015).

Generalizations to multiway EMD (comparing more than two histograms) connect to geometric volume formulas: for three distributions, EMD is half the sum of pairwise EMDs, an analogue of Heron's formula, while for higher dimensions, EMD obeys a Cayley–Menger-type determinant structure in terms of edge (pairwise) and facet (lower-dimensional simplex) distances (Erickson, 2024, Erickson, 2023).

5. Efficient Indexing, Approximate Search, and Large-Scale Data Applications

Large databases require fast sublinear search and retrieval methods based on EMD. Classical approaches use projections to one-dimensional spaces, Gaussian (normal) approximations for cumulative distributions, and lower bounds via dominance spaces to organize database objects in specialized multidimensional trees for scalable K-nearest neighbor queries (Ruttenberg et al., 2011). Data-dependent hashing achieves optimal sketching for EMD-aware nearest neighbor search (Jayaram et al., 2024).

Diffusion-based variants embed distributions on a metric graph and use the multiscale diffusion of mass, leveraging spectral graph operators to compute EMD surrogates that approximate the geodesic EMD in Õ(n) time, with strong empirical and theoretical guarantees on sample manifolds and biological data (Tong et al., 2021).

Practical large-scale image retrieval, document matching, and point cloud analysis all benefit from these methods' robustness and geometric fidelity, achieving state-of-the-art accuracy and efficiency by integrating data-parallel search, diffusion embeddings, and deep learning surrogates (Meng et al., 2024, Tong et al., 2021, Atasu et al., 2018).

6. Applications and Theoretical Insights

EMD is prominent across:

Computer vision: robust image retrieval, object tracking, and color transfer between images (Sinha et al., 2023, Yao et al., 2018).
Machine learning: loss functions for generative models, few-shot and hierarchical classification, optimal transport-based regularizations (autoencoders, GANs, dictionary learning) (Zhang et al., 2020, Shenoy et al., 2023, Martinez et al., 2016, Fan et al., 2016).
Signal processing and physics: misfit functions for waveform inversion in geophysics, exploiting EMD's convexity and sensitivity to event shift (Yong et al., 2018).
Data analysis and clustering: as a metric for comparing empirical histograms (e.g., grade distributions across classes), with strong connections to the Segre embedding and algebraic geometry (Bourn et al., 2019, Erickson, 2020, Erickson, 2024).
Topological data analysis: multiway EMD induces filtrations on cubes of distributions, facilitating higher-order relationships analysis (Erickson, 2023).

7. Geometric, Statistical, and Combinatorial Structure

EMD's underlying geometry provides a natural metric on the space of probability measures, aligning well with the intrinsic structure of datasets sampled from manifolds. Analytic results include explicit expected values of EMD under Dirichlet or uniform laws (via generating functions related to Hilbert series), combinatorial algorithms for multiway comparisons, and geometric analogies such as the earth mover's simplex, further linking EMD to classical volume formulas and invariants beyond pairwise relationships (Erickson, 2024, Erickson, 2023, Bourn et al., 2019). Statistical applications use the expected EMD as a baseline for empirical comparisons and null models.

References (arXiv IDs for further reading): (Zhang et al., 2020, Erickson, 2024, Shenoy et al., 2023, Li et al., 2020, Erickson, 2020, Meng et al., 2024, Erickson, 2023, Ruttenberg et al., 2011, Martinez et al., 2016, Tong et al., 2021, Sinha et al., 2023, Jayaram et al., 2024, Treleaven et al., 2013, Atasu et al., 2018, Gardner et al., 2015, Bourn et al., 2019, Rohatgi, 2019, Fan et al., 2016, Yao et al., 2018, Yong et al., 2018).