Regularized Optimal Transport

Updated 21 April 2026

Regularized Optimal Transport is a framework that augments classical OT with convex regularizers, ensuring unique, smooth, and numerically stable solutions.
It incorporates various schemes such as entropic, quadratic, Tsallis, and Bregman divergences to enable robust and scalable algorithms.
ROT finds practical applications in deep learning, computational geometry, statistical estimation, and domain adaptation across large-scale data analysis.

Regularized Optimal Transport (ROT) generalizes the classical optimal transport (OT) problem by augmenting the basic cost minimization over couplings with a convex regularization term. This modification enforces uniqueness, smoothness, numerical stability, computational acceleration, and, in many regimes, improved statistical or interpretative properties for the solutions. ROT encompasses a variety of regularization schemes, including entropic, quadratic, Tsallis, and general Bregman divergences. The field has seen rapid developments in theory, algorithms, and applications, including deep learning, large-scale data analysis, computational geometry, and geometric mechanics.

1. Mathematical Formulations and Key Regularizers

Let $\mu$ and $\nu$ be probability measures (or nonnegative vectors in the discrete case) and $c$ a measurable cost function. The classical Kantorovich OT problem is

$\min_{\pi \in \Pi(\mu, \nu)} \int c(x, y)\, d\pi(x, y),$

where $\Pi(\mu, \nu)$ is the set of couplings with marginals $\mu$ and $\nu$ . The ROT problem adds a convex regularizer: $\min_{\pi \in \Pi(\mu, \nu)} \int c(x, y)\, d\pi(x, y) + \varepsilon R(\pi),$ where $R$ is proper, convex, and typically lower semicontinuous.

Common regularization choices include:

Shannon entropy (entropic ROT): $R(\pi) = \text{KL}(\pi\,\|\;\mu\otimes\nu)$ , yielding the Sinkhorn divergence and rapid, parallelizable algorithms (1711.01851).
Quadratic (Euclidean) regularization: $\nu$ 0, promoting sparse and low-rank solutions with Newton-type or splitting algorithms (Lorenz et al., 2019, Lindbäck et al., 2023).
Tsallis entropies and $\nu$ 1-type regularizers: $\nu$ 2, interpolating between quadratic and entropic regimes (González-Sanz et al., 1 Apr 2026, Suguro et al., 2023).
General Bregman divergences: $\nu$ 3 taken as the sum of Legendre-type separable convex functions, leading to matrix-nearness and Bregman-projection interpretations (Dessein et al., 2016, Morikuni et al., 2023).
Dual regularization (DROT): Regularization is imposed on the dual potentials, controlling sparsity and explicit mass creation/destruction (Sonthalia et al., 2020).

The regularization parameter $\nu$ 4 modulates the interpolation between the degenerate, possibly nonunique classical solution and a fully regularized (typically unique, smooth, and dense) solution.

2. Geometric, Dual, and Variational Perspectives

ROT admits several equivalent variational and geometric perspectives:

Dynamical/geometric formulations: In the Benamou–Brenier dynamic formulation, regularized OT corresponds to "geodesics with potential" in the Wasserstein space. Notably, the Schrödinger bridge problem is a dynamical variant of entropic ROT, with the entropy term interpreted as a diffusive regularizer on mass flux. The geometric Hopf–Cole transformation provides an explicit means to decouple the transport PDEs into heat flows in specific settings (Léger, 2017).
Duality and adversarial cost: Fenchel duality reveals that any convex regularization of the coupling can be interpreted as an adversarial (robust) OT problem in which the ground cost is perturbed, penalized via the convex conjugate of the regularizer (Paty et al., 2020). For example, in entropic regularization, the conjugate is the log-partition function; for the quadratic case, it is a scaled Euclidean norm.
Matrix-nearness/Bregman projection: For separable regularizers, ROT can be framed as a matrix-nearness problem (with respect to the Bregman divergence associated to $\nu$ 5), seeking the nearest coupling to a "Gibbs kernel" under the constraints (Dessein et al., 2016).
Unbalanced and dual regularization: Extensions allow mass creation or destruction, either via softening the marginal constraints (adding convex discrepancy penalties) or directly regularizing dual potentials (DROT) to control sparsity and directionality of mass variation (Sonthalia et al., 2020).

3. Computational Algorithms

ROT leads to highly scalable and robust computational schemes, adapted to the choice of regularizer and the desired applications:

Sinkhorn-Knopp and Bregman projection methods: For entropic/Bregman regularization, Sinkhorn scaling (matrix balancing) provides rapidly convergent, parallel algorithms with global convergence guarantees. Overrelaxed and Anderson-accelerated variants further increase convergence rates, especially in low- $\nu$ 6 regimes (1711.01851, Lindbäck et al., 2023).
Splitting and projection algorithms: Douglas–Rachford splitting and ADMM-based methods handle arbitrary convex regularizers (including group-sparsity, quadratic, $\nu$ 7, or nuclear-norm), with strong guarantees on support identification and linear convergence in the active set (support) regime (Lindbäck et al., 2023).
Newton and quasi-Newton methods: Quadratic regularization permits efficient second-order optimization, exploiting the structure of (possibly sparse) Laplacians, leading to rapid local convergence even for large-scale problems on graphs and in imaging (Essid et al., 2017, Lorenz et al., 2019).
Stochastic and online optimization: For large-scale statistical settings (e.g., learning Wasserstein estimators or barycenters), ROT enables direct stochastic gradient methods on the dual, with per-iteration cost sublinear in the problem dimensions (Ballu et al., 2020).
Domain decomposition: In geometric or imaging applications, domain decomposition allows parallel solution of local ROT subproblems, leveraging adaptive sparsity and multiscale schemes for scalability (Bonafini et al., 2020).
Deep learning architectures: ROT is the basis for parameterized pooling layers and implicit layers in neural networks, unifying mean, max, and attention as special cases of entropically-regularized OT (Xu et al., 2022).

4. Theoretical Properties: Sparsity, Convergence, and Statistical Behavior

The choice and strength of the regularizer dictates key properties:

Sparsity and smoothness: Entropic regularization produces fully dense transport plans for any positive $\nu$ 8, whereas quadratic and $\nu$ 9-type regularizers provide solutions whose support shrinks to that of OT as $c$ 0; precise scaling laws (e.g., support radius scaling as $c$ 1) are now established (González-Sanz et al., 1 Apr 2026). Quadratic and dual-regularization can directly enforce sparsity (Sonthalia et al., 2020, Lorenz et al., 2019).
Convergence to OT and bias: For entropic regularization and quadratic cost in $c$ 2 dimensions, the bias decays as $c$ 3 (Eckstein et al., 2022). For $c$ 4 and Tsallis penalties, the convergence is algebraic rather than logarithmic: $c$ 5 or, more generally, for Tsallis- $c$ 6 entropy, the rate is $c$ 7 (Suguro et al., 2023).
Super-exponential error decay: For general Bregman-type divergences, the decrease of the residual error is super-exponential in $c$ 8, providing guidance for selectivity of $c$ 9 vis-à-vis computational stability (Morikuni et al., 2023).
Statistical asymptotics: ROT plans (with strictly convex regularizers such as entropy) admit smooth, $\min_{\pi \in \Pi(\mu, \nu)} \int c(x, y)\, d\pi(x, y),$ 0 dependence on the input measures. Both the transport plan and the ROT value obey central limit theorems with explicitly computable covariances, and the naive plug-in bootstrap is consistent for inference (Klatt et al., 2018). This enables construction of finite-sample Gaussian confidence bands for statistics derived from regularized couplings.
Finite-support identification and convexity: Under mild assumptions, the support of the regularized plan exactly identifies that of the OT plan in finitely many iterations of many algorithms, and uniformly strong convexity propagates to dual optimization variables (González-Sanz et al., 1 Apr 2026, Lindbäck et al., 2023).

5. Applications and Interpretative Frameworks

The flexibility of ROT has catalyzed a wide array of applications and methodological innovations:

Kernel design and deep implicit layers: By learning regularization parameters and cost structures, ROT generalizes mean, max, and attention-pooling in neural architectures, and enables their continuous interpolation via learnable parameters. Hierarchical structures and end-to-end differentiability are realized via deep implicit layers based on Sinkhorn or ADMM optimization (Xu et al., 2022).
Domain adaptation and generative modeling: Structured regularizers (e.g., group-lasso, $\min_{\pi \in \Pi(\mu, \nu)} \int c(x, y)\, d\pi(x, y),$ 1, nuclear norm) allow for interpretable feature/axis selection and low-rank mappings in domain adaptation, multi-cell RNAseq and spatial transcriptomics, and generative adversarial networks (Sebbouh et al., 2023, Lindbäck et al., 2023).
Imitation learning and inverse problems: ROT-augmented objective functions provide trajectory-matching rewards in imitation learning, accelerating convergence and stabilizing policy learning with differentiable, data-driven rewards (Haldar et al., 2022).
Geometric and Riemannian contexts: On Riemannian manifolds, dynamical analogues of Schrödinger bridges and Yasue-type mechanics extend ROT theory to curved spaces, with analogues of entropy-potentials and Hopf–Cole transforms for algorithmic and theoretical analyses (Léger, 2017).
Statistical estimation and high-dimensional shrinkage: For Gaussian and $\min_{\pi \in \Pi(\mu, \nu)} \int c(x, y)\, d\pi(x, y),$ 2-normal distributions, closed-form solutions for entropy-regularized OT give insight into geometric and shrinkage properties of statistical estimators, including the construction of regularized barycenters and covariance estimators (Tong et al., 2020).
Large-scale computation: GPU-accelerated and parallelizable algorithms enable scaling ROT solvers to $\min_{\pi \in \Pi(\mu, \nu)} \int c(x, y)\, d\pi(x, y),$ 3 (Lindbäck et al., 2023), and domain decomposition plus multiscale strategies further facilitate application to gigapixel image transport and high-dimensional data (Bonafini et al., 2020).

6. Extensions, Limitations, and Ongoing Research Directions

Active research continues on multiple axes:

Generalization of regularizers: Orlicz spaces (via Young functions), Tsallis and other non-Shannon entropies, capacity-constrained and group-structured regularizations are being investigated both for theoretical implications (e.g., metricity, bias–variance tradeoff) and for novel application demands (Lorenz et al., 2019, Dessein et al., 2016).
Unbalanced OT: DROT provides a regularized framework for mass creation/destruction with explicit interpretability and sparsity, bridging traditional unbalanced OT and classical ROT (Sonthalia et al., 2020).
Accelerated algorithms: Overrelaxed Bregman projections, Anderson acceleration, and finite identification methods are being studied for their capacity to achieve O(1/ε) or better iteration rates, especially in the low-regularization regime (1711.01851).
Statistical bias and rate selection: The algebraic and logarithmic rates of convergence to OT, now sharply characterized for various regularizers, inform principled tuning of $\min_{\pi \in \Pi(\mu, \nu)} \int c(x, y)\, d\pi(x, y),$ 4 to balance computational tractability and statistical accuracy in high-dimensions (Eckstein et al., 2022, Suguro et al., 2023).
Robust optimization and adversarial perspectives: Fenchel duality links regularization to ground-cost-robustness, underpinning developments in robust distance measure design (Paty et al., 2020).
Open problems: Outstanding challenges include fully characterizing metric properties under general regularization, designing adaptive regularization parameter selection schemes for large data and learning, and extending finite-sample concentration results to functional data, infinite-dimensional settings, and multi-marginal contexts (Klatt et al., 2018, Morikuni et al., 2023, Eckstein et al., 2022).

Regularized Optimal Transport thus represents a mature and rapidly evolving field at the interface of convex analysis, statistical estimation, geometric mechanics, algorithmic optimization, and high-dimensional data science, with a rich foundation and a variety of active research frontiers.