Statistical Optimal Transport

Updated 30 June 2026

Statistical Optimal Transport is a framework that integrates optimal transport with statistical inference to accurately estimate distances and couplings from finite samples.
It employs regularization techniques such as entropic, sliced, and kernel-based methods to improve convergence rates and mitigate the curse of dimensionality.
SOT provides scalable computational strategies including streaming algorithms and projection methods, making it practical for high-dimensional applications in machine learning and graphics.

Statistical Optimal Transport (SOT) encompasses a family of methodologies and theoretical frameworks that study optimal transport (OT)—the infimum of a transport cost over couplings between probability distributions—in the presence of sampling uncertainty. This area unifies statistical inference, computational optimal transport, and regularization theory to address the fundamental challenge that in practical applications, underlying measures are known only through finite samples. SOT investigates rates of convergence, asymptotic and non-asymptotic limit theorems, regularized and robust estimators, and computational strategies for reliable OT-based distances, couplings, and maps in moderate and high dimensions. The field includes developments such as sliced and smoothed OT distances, entropic and kernel-based regularizations, semiparametric statistical theory, and generalizations to unbalanced and stochastic settings, forming a rigorous and scalable toolkit for statistical inference and learning with distributions (Chewi et al., 2024, Goldfeld et al., 2022).

1. Mathematical Foundations and Formal Problem

Statistical Optimal Transport is grounded in the classical Monge and Kantorovich formulations of OT. Given two Borel probability measures $\mu, \nu$ on $\mathbb{R}^d$ and a lower semi-continuous cost $c(x, y)$ , the Kantorovich OT problem seeks

$W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$

where $\Pi(\mu, \nu)$ comprises couplings with marginals $\mu$ , $\nu$ (Chewi et al., 2024). In statistical settings, empirical measures $\mu_n = \frac{1}{n}\sum_{i=1}^n \delta_{X_i}$ and $\nu_m= \frac{1}{m}\sum_{j=1}^m \delta_{Y_j}$ are constructed from i.i.d. samples, and the empirical OT cost $W_c(\mu_n, \nu_m)$ is studied.

For $\mathbb{R}^d$ 0, one obtains the $\mathbb{R}^d$ 1-Wasserstein distance $\mathbb{R}^d$ 2. The dual formulation involves maximizing

$\mathbb{R}^d$ 3

Empirical OT is interpreted as a U-statistic of order two, with the associated concentration and stability theory underpinning the statistical analysis of plug-in or regularized estimators.

2. Statistical Rates, Limit Theorems, and Efficiency

A central focus of SOT is understanding the convergence and fluctuation properties of OT distances and related estimators. Classical results yield the following minimax rates for empirical OT distances:

For $\mathbb{R}^d$ $R^{d}$ 4, when $\mathbb{R}^d$ $R^{d}$ 5:
- $\mathbb{R}^d$ 6: $\mathbb{R}^d$ 7
- $\mathbb{R}^d$ 8: $\mathbb{R}^d$ 9
- $c(x, y)$ 0: $c(x, y)$ 1
For $c(x, y)$ 2, $c(x, y)$ 3 (Chewi et al., 2024).

Plug-in estimators suffer from the curse of dimensionality, motivating structural regularization. For estimators with low transport rank (FactoredOT), the empirical process bound improves to $c(x, y)$ 4 uniformly over low-complexity sets, breaking the $c(x, y)$ 5 curse (Forrow et al., 2018).

Statistical optimal transport further encompasses central limit theorems (CLTs), bootstrap consistency, and semiparametric efficiency for regularized OT distances (sliced, smoothed, entropic). When the empirical functional is Hadamard-differentiable and the class of dual potentials is Donsker, efficient estimators are achievable and the bootstrap is valid (Goldfeld et al., 2022).

3. Regularization and Dimension-Free Approaches

Numerous regularization schemes have been developed to control bias-variance tradeoff and computational tractability:

Entropic Regularization (Sinkhorn): Adds $c(x, y)$ 6 to the primal cost. Solved efficiently via matrix scaling (Sinkhorn), and retains stability and CLT properties under regularity assumptions (Chewi et al., 2024, Goldfeld et al., 2022).
Sliced Optimal Transport (SOT): Projects measures onto 1D lines, computes 1D OT, and integrates over directions. Achieves $c(x, y)$ 7 rates and mitigates the curse of dimension (Nguyen, 11 May 2025). Sliced-regularized OT (SROT) introduces a reference SOT plan as a prior for entropic OT, yielding lower bias and improved finite-sample performance over classical EOT (Nguyen, 27 Apr 2026).
Kernel-based Estimators: Recast OT as learning a kernel mean embedding of the transport plan, with MMD regularization conferring dimension-free sample complexity (Nath et al., 2020). Dimension-free rate $c(x, y)$ 8 is attained for plan and barycentric map estimation.
Transport-Rank Regularization: Promotes low-complexity couplings, robustly overcoming the high-dimensional curse by restricting the feasible set to factored or low-rank couplings (Forrow et al., 2018).

Streaming algorithms (Stream-SW) further allow online computation of sliced distances with polylogarithmic memory footprint and statistical rates matching the batch regime (Nguyen, 11 May 2025).

4. OT Maps: Statistical Estimation and Generalizations

Estimating the optimal transport map $c(x, y)$ 9 is essential for interpreting OT as a geometry-driven transformation, but strong regularity (e.g., Brenier's theorem for absolutely continuous $W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$ 0) is often unattainable in applications. SOT theory covers:

Plug-in and Barycentric Map Estimation: Empirical couplings $W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$ 1 yield barycentric projections $W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$ 2, with minimax rates $W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$ 3 for 1-NN extensions in $W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$ 4 (Balakrishnan et al., 23 Jun 2025).
Dual and Semi-dual Estimation: Empirical risk minimization over potentials (in RKHS or neural parameterizations) is used to recover maps as gradients $W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$ 5.
Entropic and Kernel-based Maps: Entropic OT yields smooth maps via Sinkhorn potentials; kernel or MMD-based frameworks allow out-of-sample generalization and dimension-independent risk guarantees (Nietert et al., 10 Dec 2025, Nath et al., 2020).
Stochastic OT Maps: For cases lacking classical determinism, map estimation is reframed in terms of Markov kernels, evaluated with an error functional capturing optimality and feasibility gaps, and yielding robust rates under minimal moment or tail assumptions (Nietert et al., 10 Dec 2025).

Special cases—such as 1D transport, semi-discrete problems, or Gaussian measures—admit explicit rates and limiting distributions. Convolutional robust estimators accommodate adversarial contamination.

5. Extensions: Unbalanced, Sliced, and Stochastic OT

Statistical OT incorporates broad generalizations:

Unbalanced OT: Lifts the marginal constraint, allowing positive (possibly unequal mass) measures and divergences (e.g., KL, GHK) for penalizing marginal mismatch. Sliced-Unbalanced OT (SUOT/USOT) efficiently combines slicing with unbalanced relaxation, achieving robustness to outliers and mass discrepancies with scalable Frank–Wolfe solvers (Bonet et al., 2023).
Non-Euclidean and Manifold OT: Sliced methods are extended to domains such as spheres, hyperbolic spaces, and projective spaces for intrinsic sampling and geometry-processing tasks (Genest et al., 2024).
Stochastic OT (SOT, Editor’s term): Poses transport as a stochastic control problem, interpolating between deterministic (classical OT) and entropy-dominated (e.g., Schrödinger) regimes. Mean-field statistical mechanics elucidate the transition from entropy- to cost-dominated structures, with full analytical characterization of sub-optimal regimes (Mikami, 2023, Piombo et al., 4 Feb 2026).

6. Geometrization, Bayesian Inference, and Random Measure Extensions

Recent advances lift optimal transport theory to the setting of random probability measures, especially in Bayesian learning contexts:

$W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$ 6 over Wasserstein Spaces: The $W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$ 7 metric on random probability measures, $W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$ 8, inherits the Riemannian geometry of Wasserstein space, supporting geodesics, lifted (random) gradient flows, and statistical consistency for empirical and posterior measures (Passeggeri et al., 20 May 2026).
Posterior Consistency: Wasserstein versions of Schwartz's theorem guarantee posterior contraction in $W_c(\mu,\nu) = \inf_{\pi\in\Pi(\mu,\nu)} \int c(x,y)\, d\pi(x,y),$ 9; this framework unifies empirical, Bayesian, and gradient-flow analyses.
Applications in Modern ML: The lifted theory is used to analyze Transformer self-attention dynamics as Wasserstein gradient flows on the sphere under random sampling, quantifying the stability of learned representations under sampling and model uncertainty.

7. Computational Considerations and Empirical Performance

Statistical OT balances statistical rates with algorithmic scalability:

Sinkhorn and Entropic OT: $\Pi(\mu, \nu)$ 0 per iteration; scalable GPU implementations.
Sliced/Projected OT: $\Pi(\mu, \nu)$ 1 for $\Pi(\mu, \nu)$ 2 random projections; inherently parallelizable and dimension-agnostic.
Kernel and Streaming Methods: Achieve near-linear complexity in sample size and are amenable to hardware acceleration or data streams (Nguyen, 11 May 2025, Lin et al., 2023).
FactoredOT: Alternating minimization with barycentric updates; efficient for small transport rank (Forrow et al., 2018).

Empirically, regularized SOT methods outperform classical OT in high dimensions, especially with low-structure or outlier-prone data, and adapt to real-world tasks such as domain adaptation, clustering, color transfer, geophysical barycenters, and 3D mesh sampling (Nguyen, 27 Apr 2026, Bonet et al., 2023, Genest et al., 2024).

In summary, Statistical Optimal Transport provides a mathematically rigorous, computationally scalable, and statistically efficient framework for extracting geometric information from finite samples of probability measures. Leveraging structural regularization, projection, kernel methods, and random measure theory, SOT methods have become core tools for distributional inference in high-dimensional statistics, machine learning, graphics, and beyond (Chewi et al., 2024, Goldfeld et al., 2022, Passeggeri et al., 20 May 2026, Nguyen, 27 Apr 2026, Forrow et al., 2018, Nietert et al., 10 Dec 2025).