Statistical Optimal Transport

Updated 5 February 2026

Statistical optimal transport is a framework for estimating optimal transport distances and maps from finite sample data, integrating inference, convex analysis, and empirical process theory.
Methodologies such as plug-in estimators, semidual approaches, and entropic regularization overcome high-dimensional challenges and achieve dimension-free convergence rates.
Applications span machine learning, generative modeling, domain adaptation, and robust inference, with rigorous performance guarantees and uncertainty quantification.

Statistical optimal transport (SOT) is the field concerned with inference, estimation, and computational analysis of optimal transport (OT) quantities—distances, couplings, and maps—given finite samples from unknown probability distributions. SOT lies at the intersection of mathematical statistics, high-dimensional probability, convex analysis, empirical process theory, and computational mathematics, with applications spanning machine learning, biology, generative modeling, and information geometry. The central statistical problem is to consistently and efficiently recover population-level OT functionals (e.g., Wasserstein distances or Monge maps) based solely on sample data, and to provide rigorous uncertainty quantification, rates of convergence, and efficient algorithms adapted to high-dimensional or structured regimes.

1. Formulations and Theoretical Foundations

Let $\mu, \nu$ be Borel probability measures on $\mathbb{R}^d$ and $c(x,y)$ a lower semicontinuous cost (commonly, $c(x,y) = \|x - y\|^p$ for $p \ge 1$ ).

Monge Problem: Find a measurable map $T : \mathbb{R}^d \to \mathbb{R}^d$ with $T_{\#}\mu = \nu$ minimizing $\int c(x, T(x))\, d\mu(x)$ .
Kantorovich Problem: Minimize $\int c(x, y)\, d\gamma(x,y)$ over couplings $\gamma$ with marginals $\mu$ , $\nu$ . This form always admits a solution and defines the $p$ -Wasserstein distance:

$W_p(\mu, \nu) = \left(\inf_{\gamma \in \Gamma(\mu, \nu)} \iint \|x-y\|^p\, d\gamma(x,y)\right)^{1/p}.$

Duality: The dual problem involves maximizing $\int f\,d\mu + \int g\,d\nu$ subject to $f(x) + g(y) \leq c(x, y)$ , with $f, g$ in suitable function spaces. The optimal map exists and equals the gradient of a convex function (the Brenier map) when $p=2$ and $\mu$ is absolutely continuous (Chewi et al., 2024, Balakrishnan et al., 23 Jun 2025).

Statistical optimal transport concerns itself with estimating these quantities using empirical measures $\hat{\mu}_n = \frac{1}{n}\sum_{i=1}^n \delta_{X_i},\, \hat{\nu}_n = \frac{1}{n}\sum_{i=1}^n \delta_{Y_i}$ from i.i.d. samples.

2. Statistical Estimation and Computational Methodologies

2.1 Plug-in and Semidual Approaches

The plug-in estimator replaces $\mu$ and $\nu$ by their empirical distributions in the OT problem, yielding

$\widehat{W}_p = W_p(\hat{\mu}_n, \hat{\nu}_n),$

and analogously for couplings and maps (Chewi et al., 2024, Balakrishnan et al., 23 Jun 2025). However, in high dimensions, this approach suffers from the classical curse of dimensionality, with convergence rates $\sim n^{-1/d}$ for $d \ge 3$ (Forrow et al., 2018, Ding et al., 2024).

The semidual approach estimates the Kantorovich potential by solving the empirical dual

$(\hat f, \hat g) = \arg\max_{f,g}\ \frac{1}{n}\sum_{i}f(X_i) + \frac{1}{n}\sum_j g(Y_j) \text{ subject to } f(x) + g(y) \leq c(x,y).$

The estimated OT map is then $\widehat{T} = \nabla\hat\varphi$ , provided $\hat f = \|\,\cdot\,\|^2 - 2\hat\varphi$ (Chewi et al., 2024, Balakrishnan et al., 17 Feb 2025).

2.2 Regularization and Dimension-Free Methods

Due to limited scalability, several regularized and structural approaches have been proposed:

Entropic Regularization: The entropic OT problem adds an entropy penalty $\varepsilon \, \mathrm{KL}(\gamma \| \mu \otimes \nu)$ , leading to computationally tractable Sinkhorn algorithms and dimension-independent statistical rates for fixed $\varepsilon > 0$ (Goldfeld et al., 2022, Chewi et al., 2024).
Factored Couplings/Transport Rank: Low-rank structure is imposed on couplings; the FactoredOT algorithm constructs transport plans with low transport rank via alternating minimization over cluster centers and transport plans using entropic Sinkhorn regularization. This breaks the $n^{-1/d}$ curse, achieving parametric rates $\sqrt{(k^3 d \log k)/n}$ dependent only on the transport rank $k$ , rather than the ambient dimension (Forrow et al., 2018).
Kernel Mean Embedding: OT is reformulated as learning a kernel mean embedding of the transport plan, regularized by the maximum mean discrepancy (MMD), yielding dimension-free sample complexity $O(1/\sqrt{n})$ (Nath et al., 2020).
RKHS/Infinite-Dimensional SOS Methods: For smooth densities, sum-of-squares representations and kernel embeddings yield estimators with sample and time exponents independent of the dimension for sufficiently high smoothness, circumventing the curse at the expense of exponentially large constants (Vacher et al., 2021).

2.3 Neural and Adversarial OT Solvers

Dual potentials and transport maps are parameterized using input-convex neural networks (ICNNs). Adversarial or minimax (semi-dual) neural optimizations provide end-to-end statistical guarantees for the learned map, with errors controlled by Rademacher complexity and architecture capacity. Convergence rates of $O(1/\sqrt{n})$ plus approximation bias are obtained under strong convexity and boundedness (Tarasov et al., 3 Feb 2025, Ding et al., 2024).

3. Convergence Rates, Stability, and Minimax Theory

The convergence rates depend critically on the regularity of the underlying distributions, the complexity of the function class, and imposed structure:

General Distributions: Plug-in estimators for $W_1$ satisfy $\mathbb{E} W_1(\hat\mu_n, \mu) \asymp n^{-1/d}$ for $d \ge 3$ , and similar dimension-dependent rates for $W_p$ (Chewi et al., 2024).
Smooth Densities / Log-Concave: If both $p$ and $q$ are smooth and supported on convex bodies, minimax-optimal rates for the $L^2$ -risk of OT map estimation are $n^{-2/d}$ (high $d$ ) (Balakrishnan et al., 17 Feb 2025), while for density estimation, faster $n^{-2(s+1)/(2s + d)}$ rates are achievable when densities are $C^s$ (Balakrishnan et al., 23 Jun 2025).
Local Poincaré Inequalities: Novel local Poincaré-type inequalities allow variance control for differences of smooth potentials under only local density and mild topological conditions, giving parametric or near-parametric rates in Donsker regimes for broad function classes (Ding et al., 2024).
Low Transport Rank: Estimation errors depend only on the transport rank $k$ and not directly on dimension, e.g., $O(\sqrt{k^3 d \log k / n})$ (Forrow et al., 2018).
Entropic and Sliced OT: Regularization (entropic, slicing, kernel smoothing) yields dimension-free $n^{-1/2}$ rates in many settings (Goldfeld et al., 2022).

Non-asymptotic stability bounds further show that estimation error in the OT map can be upper-bounded by a function of the $W_2$ errors of the input distributions, plus moment and smoothness constants (Balakrishnan et al., 17 Feb 2025, Ding et al., 2024, Balakrishnan et al., 23 Jun 2025).

4. Distributional Limit Theory, Inference, and Robustness

Classical and recent results provide precise distributional descriptions for OT functionals:

Central Limit Theorems: For one-dimensional Wasserstein distances, $\sqrt{n} (W_p(\hat\mu_n, Q) - W_p(\mu, Q))$ converges to a Gaussian law under mild moment and smoothness conditions (Barrio et al., 25 May 2025, Ponnoprat et al., 2023). In higher dimensions, the limiting variance is given by the variance of the Kantorovich potential evaluated at $P$ .
Non-Gaussian Limits: For discrete or semi-discrete distributions, directional Hadamard differentiability yields limit distributions that may not be Gaussian (Sadhu et al., 2023).
Bootstrap and Confidence Bands: Uniform confidence bands for OT maps on the real line and robust bootstrap procedures (e.g., $m$ -out-of- $n$ bootstrap for directionally Hadamard differentiable cases) have been developed and theoretically validated (Ponnoprat et al., 2023, Hundrieser et al., 2021).
Robust and Outlier-Resistant OT: The $\varepsilon$ -outlier-robust Wasserstein distance allows for trimming of distributions, yielding minimax-optimal rates under Huber-type contamination and practical dual forms for robust inference (Nietert et al., 2021).

5.1 Generalizations of Statistical OT

Statistical Manifold Embeddings: OT with cumulant-generating costs induces a geometry on the space of probability distributions, connecting OT with information geometry of exponential families (Pal, 2017).
Chain-Rule Optimal Transport (CROT): OT distances defined on marginals and conditionals yield metrics on mixture models, upper-bounding $f$ -divergences in mixture families, and supporting fast Sinkhorn-type computation for hierarchical OT problems (Nielsen et al., 2018).
Transport Dependency: Transport dependency and its normalized forms provide statistically consistent and flexible correlation-like dependence measures with formal properties analogous to distance correlation, adaptive to the intrinsic metric structure (Nies et al., 2021).

5.2 Applications

Domain Adaptation: Low-rank statistical OT has demonstrated superior accuracy in transferring biological labels across single-cell RNA-seq protocols (Forrow et al., 2018).
Generative Modeling: Adversarial neural OT solvers align complex data distributions and enable sample-efficient generative models (Tarasov et al., 3 Feb 2025).
Dependency and Graphical Modeling: Transport-based dependency coefficients allow general-purpose, high-power independence testing and network construction in genomics (Nies et al., 2021).
Shape and Mixture Comparison: CROT and related composite OT metrics support principled comparison of complex mixtures and learning simplified Gaussian mixture models (Nielsen et al., 2018).
Robust Generative Architectures: Plugging robust OT distances into Wasserstein GANs and related architectures yields resilience to contamination without extensive hyperparameter tuning (Nietert et al., 2021).

6. Open Problems and Future Directions

While the theoretical and practical landscape of statistical optimal transport has advanced substantially, several challenging avenues remain:

Curse of Dimensionality: Characterizing settings where smoothing, low-dimensional structure, or regularization defeat the $n^{-1/d}$ curse.
Nonconvex and Nonquadratic Costs: Extending limit theory for maps, plans, and costs to general ground costs and non-Euclidean geometries (Balakrishnan et al., 23 Jun 2025).
General Weak Convergence: Process-level CLTs for OT maps and potentials in high dimensions are not available outside of the most regular settings.
Adaptive and Data-Driven Tuning: Optimal, principled selection of smoothing or regularization parameters in Sinkhorn, kernel-smoothed, or sliced OT.
Unified Computational–Statistical Analysis: The joint behavior of estimation error and iteration complexity for scalable OT algorithms across sample sizes and problem parameters remains an open field.
Inference with Dependent Data and Time Series: Extending SOT to dependent samples, time-varying distributions, or online/sequential data streams.

Statistical optimal transport thus stands as a central field in modern data-driven mathematics, integrating deep convex-analytic, probabilistic, and algorithmic concepts, and serving as a foundation for robust inference, machine learning, and scientific modeling in high-dimensional and structured domains (Chewi et al., 2024, Balakrishnan et al., 23 Jun 2025, Barrio et al., 25 May 2025, Forrow et al., 2018, Ding et al., 2024).