Optimal Transport Distances

Updated 19 November 2025

Optimal transport distances are metrics that quantify the minimal cost required to morph one probability distribution into another using Monge and Kantorovich formulations.
Algorithmic approaches like Sinkhorn regularization enable scalable, GPU-accelerated computations in high-dimensional settings, balancing efficiency with controlled bias.
Extensions such as partial, directed, and graph-based optimal transport broaden applications in physics, machine learning, and network analysis.

Optimal transport distances quantify the minimal "effort" required to morph one probability distribution into another according to a specified cost function, under mass conservation constraints. Their theoretical framework, rooted in the Monge and Kantorovich formulations, underpins a diverse family of metrics—prominently, the Wasserstein distances—that facilitate geometric, statistical, and structural comparisons of distributions, signals, and even structured objects such as graphs and Markov models. Recent advances have extended their scope to regularized, partial, directed, and causal variants, and brought scalable solvers suitable for high-dimensional and large-scale settings.

1. Mathematical Foundations

Optimal transport distances are fundamentally defined through the minimization of total transportation cost. The classical Kantorovich formulation, for measures $\mu$ and $\nu$ on a metric space $(X, d)$ and cost function $c(x, y)$ , seeks a coupling $\pi$ (joint measure with given marginals) that minimizes

$\inf_{\pi \in \Pi(\mu, \nu)} \int_{X \times X} c(x, y) \, d\pi(x, y).$

For $c(x, y) = d(x, y)^p$ , this yields the $p$ -Wasserstein distance $W_p$ (Levy et al., 2017). The Monge formulation is map-based and generally non-convex; Kantorovich's relaxation allows mass splitting.

Duality principles connect these primal problems to maximization over potential pairs, leading to connections with Lipschitz functions for $W_1$ and convex potentials for $W_2$ (Levy et al., 2017). These formulations induce a Riemannian structure on spaces of measures and enable displacement interpolation (so-called geodesics in Wasserstein space).

2. Algorithmic Approaches and Complexity

Several algorithmic regimes have emerged for computing optimal transport distances, driven by application scale and regularity requirements (Dong et al., 2020, Blanchet et al., 2018):

Combinatorial exact solvers: The network simplex and augmenting-path (Hungarian/Kuhn-Munkres) methods are exact, strongly polynomial for small-to-moderate $n$ ; practical for $n \lesssim 5000$ , but scaling with $O(n^3)$ or worse.
Entropic regularization: The addition of an entropy penalty (Sinkhorn regularization) leads to strictly convex problems solvable via matrix scaling (Sinkhorn-Knopp iterations), admitting GPU acceleration and differentiability, but incurring a bias controlled by the regularization parameter $\varepsilon$ . Convergence in $O(n^2/\varepsilon^2)$ iterations; stable for $\varepsilon\gtrsim 10^{-2}$ but accuracy degrades as $\varepsilon \to 0$ (Dong et al., 2020).
Nearly-linear or quantization-based methods: Reductions to packing LPs and matrix balancing have enabled $\widetilde{O}(n^2/\epsilon)$ -time solvers for additive- $\epsilon$ error (Blanchet et al., 2018). Quantization (via $k$ -means or clustering) achieves accelerated rates for clusterable data, and sliced or minibatch strategies enable tractable large-scale or streaming computations (Beugnot et al., 2021, Fatras et al., 2021).
Primal-dual and multilevel methods: For Wasserstein-1 and other continuous cases, multilevel primal-dual schemes with coarse-to-fine acceleration dramatically reduce iteration counts in high-resolution grids (Liu et al., 2018).

The table summarizes key algorithmic strategies and their properties:

Method	Complexity (n samples)	Differentiable	Typical Use Cases
Network Simplex/Hungarian	$O(n^3)$	No	Exact small-scale OT
Sinkhorn (Entropy-regularized)	$O(n^2/\varepsilon^2)$	Yes	ML, deep learning, large n
Packing LP / Matrix scaling	$\widetilde O(n^2/\epsilon)$	No	High-accuracy OT
Quantization + OT	$O(k^3) +$ clustering	Yes/No	Large $n$ , structured data
Multilevel primal-dual	$O(N^{3}\log N)$ in 2D	Yes/No	Images, PDEs, grids

3. Generalizations and Structural Extensions

Optimal transport theory supports a variety of extensions to accommodate application requirements:

Entropic Regularization and Sinkhorn Divergence: Adding $\varepsilon$ -entropy enables differentiable, stable metrics suited for optimization within machine learning. The Sinkhorn divergence $S^\varepsilon(\mu, \nu)$ debiases the regularized cost, restoring metric properties and improving behavior in the small- $\varepsilon$ regime (Fatras et al., 2021).
Partial and Unbalanced Transport: Partial transport L $^p$ (PTL $^p$ ) distances interpolate between $L^p$ and OT distances by penalizing unmatched mass with a parameter $\lambda$ , and extend naturally to signed and multi-channel signals via graph-lifting strategies. Sliced versions enable scalability and practical high-dimensional metric learning (Liu et al., 2023).
Directed and Causal OT: Directed OT distances, including those defined via directed quantile divergences and for causal models structured by directed acyclic graphs, generalize classical OT by incorporating asymmetry, conditional independence or directionality. These enable causal Wasserstein metrics controlling transport under latent variable models or in dynamical settings (Stummer, 2021, Cheridito et al., 2023).
Structural and Graph-Based OT: Fused Gromov-Wasserstein (FGW) and Gromov-Wasserstein (GW) distances enable comparison of attributed and (possibly directed) graphs by jointly optimizing structural and feature alignment, supporting applications in graph embedding, classification, and structured data analysis. These distances admit efficient conditional-gradient and Sinkhorn-type solvers (Vincent-Cuaz et al., 2022, Nagai et al., 2023).

4. Practical Applications and Diagnostic Use

Optimal transport distances are employed as diagnostics, loss functions, or geometric metrics in a wide variety of settings:

Physics and Chemistry: Sinkhorn-divergence-based OT measures, such as $\Theta$ and its normalization $\Theta'$ , are used to characterize electronic excitation delocalization, charge transfer, and Rydberg nature, overcoming overlap-based diagnostic plateaus and providing continuous, length-scale-aware signatures (Lieberherr et al., 2023).
Machine Learning and Signal Processing: OT loss functions underpin generative adversarial training (beyond $W_1$ ), domain adaptation, and barycenter computation. Protocols such as Online Sinkhorn allow for unbiased gradient estimation in streaming settings (Mensch et al., 2020, Laschos et al., 2019).
Graph and Network Analysis: Graph OT distances integrate node and structural features for graph classification, while targeted variants efficiently compare directed cell-cell communication networks, integrating domain-specific structural metrics (Vincent-Cuaz et al., 2022, Nagai et al., 2023).
Clustering, Retrieval, and Visualization: OT-based distances—especially when parameterized for robustness via partial transport or regularized for differentiability—are leveraged for clustering, shape retrieval, and multidimensional scaling (MDS) in structured and high-dimensional data (Liu et al., 2023, Calo et al., 6 Jun 2024).

5. Advanced Topics: Chain Rule, Causal, and Bisimulation Metrics

Chain-Rule OT distances generalize traditional OT to statistical models with latent variables, by optimizing over couplings between mixture component indices with ground costs on conditional distributions. When the ground cost is a jointly convex divergence, this provides an upper bound on the divergence between induced marginals, and the formulation unifies mixture-matching, classical OT, and provides efficient entropic-regularized solvers (SCROT Sinkhorn) (Nielsen et al., 2018).

G-causal Wasserstein distances formalize transport between joint distributions respecting specified dependency graphs, enabling optimal transport distances that are sensitive to the underlying causal structure. The associated interpolation schemes guarantee preservation of $G$ -compatibility along geodesics, and provide Lipschitz control of functionals such as average treatment effects (Cheridito et al., 2023).

Bisimulation metrics, fundamentally OT distances between Markov chains, are shown to coincide exactly with discounted optimal transport costs in an extended occupancy measure space. Recent advances leverage entropy regularization and Sinkhorn Value Iteration (SVI) to solve these bisimulation OT LPs efficiently, achieving practical scalability to large Markov models and supporting interpretability via transport plan analysis and embedding (Calo et al., 6 Jun 2024).

6. Future Directions and Methodological Innovations

Key future directions include:

Combining length-scale normalized OT (e.g., $\Theta'$ ) with overlap or higher-order diffuseness features to robustly classify physical or electronic state types.
Integrating other OT metrics (standard, sliced-Wasserstein, or with alternative cost functions) with advanced classifiers to approach performance ceilings beyond 95% in demanding classification contexts (Lieberherr et al., 2023).
Exploiting low-dimensional, Lipschitz-parameterized embeddings and approximate computation (by quantization, slicing, or minibatching) for scalable, high-dimensional OT (Fulop et al., 2021, Beugnot et al., 2021).
Extending causality- and structure-aware OT formulations to dynamical, signed, or multi-modal data, and further developing efficient solvers that bridge differentiability and combinatorial performance.

Collectively, optimal transport distances now constitute a flexible toolkit for defining, computing, and deploying geometric, statistical, and structural metrics across the physical, biological, and data sciences, with continuing theoretical, algorithmic, and application-driven growth.