Regularized Optimal Transport

Updated 17 October 2025

Regularized Optimal Transport is a framework that integrates smooth convex penalties into classical OT to ensure uniqueness, stability, and improved computational properties.
It enables a tunable interpolation between Earth Mover's Distance and maximum entropy solutions, balancing data fidelity with regularization for robust transport plans.
Efficient algorithms like alternate scaling and Newton–Raphson methods leverage Bregman projections, making ROT applicable in machine learning, pattern recognition, and large-scale data analysis.

Regularized Optimal Transport (ROT) constitutes a broad class of modifications to the classical optimal transport problem that introduces a smooth convex penalty, typically via a function of the transport plan, to confer strict convexity, improved statistical and computational properties, and algorithmic tractability. This framework subsumes entropic regularization, quadratic (Euclidean) regularization, f-divergence penalties, and more general matrix or function space Bregman divergences. Theoretical advances demonstrate that with suitable regularization, one achieves well-posedness of the transport plan (often strictly positive or belonging to some Orlicz space), efficient primal–dual optimization algorithms, interpolations between Earth Mover’s Distance (EMD) and minimal-information solutions, and highly favorable statistical rates. The concept extends naturally to settings with additional constraints, as in the Rot Mover’s Distance (RMD) framework and its algorithmic variants.

1. Mathematical Foundations and Matrix Nearness Perspective

At its core, the regularized optimal transport problem augments the classical finite-dimensional OT objective

$\min_{\pi \in \Pi(p, q)} \langle \pi, \gamma \rangle$

with a convex, typically smooth penalty: $\min_{\pi \in \Pi(p, q)} \langle \pi, \gamma \rangle + \lambda \varphi(\pi)$ where $\Pi(p, q) = \{ \pi \in \mathbb{R}_+^{d \times d}: \pi 1 = p, \pi^T 1 = q \}$ denotes the transport polytope.

A central result is that this problem is equivalent to finding the Bregman projection of a reference matrix $\xi$ onto $\Pi(p, q)$ with respect to the Bregman divergence $B_\varphi$ generated by $\varphi$ : $\pi^* = \arg\min_{\pi \in \Pi(p, q)} B_\varphi(\pi \| \xi)$ For example, if $\varphi$ is the negative Boltzmann–Shannon entropy, then $B_\varphi$ is the Kullback–Leibler divergence and the reference $\xi$ structure is related to the exponentiated negative cost matrix.

This equivalence reveals deep connections to matrix nearness problems, exponential family/statistical estimation of contingency tables, and information geometry, which are exploited for both analysis and algorithm development (Dessein et al., 2016).

2. Interpolation of Transport Distances: The Rot Mover’s Distance

The regularized cost defines the “rot mover’s distance” (RMD), which serves as an interpolation between the fully regularized transport plan minimizing the Bregman information and the unregularized earth mover’s distance (EMD) underlying classical OT. Specifically,

$d_{(\gamma, \lambda, \varphi)}(p, q) = \langle \pi^*, \gamma \rangle,\quad \pi^* = \arg\min_{\pi \in \Pi(p, q)} \{ \langle \pi, \gamma \rangle + \lambda \varphi(\pi) \}$

As $\lambda \to 0$ , RMD converges to EMD; for $\lambda \to \infty$ , the solution is the unique plan on $\Pi(p, q)$ with minimal Bregman information (maximal entropy in the entropic case) (Dessein et al., 2016). The parameter $\lambda$ controls the tradeoff between data fidelity (measured by transportation cost) and regularization (measured by $\varphi$ ).

3. Algorithmic Schemes: Bregman Projections and Newton–Raphson Steps

Computing ROT plans efficiently hinges on structured projection algorithms:

Alternate Scaling Algorithm (ASA): For regularizers defined on the positive orthant, an alternating projection onto constraints (row and column sums) can be implemented, reducing to the Sinkhorn–Knopp algorithm in the entropic case. Each projection seeks Lagrange multipliers through analytic or iterative (Newton–Raphson) root-finding to enforce marginalization.
Non-negative Alternate Scaling Algorithm (NASA): If the regularizer’s domain does not guarantee nonnegativity, Dykstra’s algorithm is employed, combining projections onto margins and the non-negative orthant, with correction steps ensuring convergence to the Bregman projection.

Both algorithms operate within the dual parameterization $\theta = \nabla \varphi(\pi)$ , with updates governed by the Fenchel conjugate $\psi$ . The algorithms rely on closed-form or root-finding solutions to constraints that may be efficiently solved when $\varphi$ is separable (Dessein et al., 2016). Newton steps solve equations of the form

$f(\mu_i) = \sum_j \psi'(\bar{\theta}_{ij} - \mu_i) - p_i = 0$

with iterated updates $\mu_i \leftarrow \mu_i - f(\mu_i) / f'(\mu_i)$ .

A sparse extension reduces computational and storage requirements by restricting iteration to the "active" indices where the ground cost $\gamma$ is finite, enabling ROT methods to scale to high-dimensional distributions without $O(d^2)$ memory.

4. Diversity of Regularizers and Associated Divergences

The unified ROT formulation encompasses a broad class of strictly convex regularizers $\varphi$ and their associated Bregman divergences, yielding a rich variety of transport plan geometries. The principal cases instantiated in the framework include:

Regularizer $\varphi$	Bregman Divergence	Plan Properties / Algorithmic Implications
Boltzmann–Shannon entropy	Kullback–Leibler	Dense plans, Sinkhorn–Knopp scaling
Burg entropy	Itakura–Saito	For nonnegative data, "spiky" plan profiles
Fermi–Dirac entropy	Logistic/loss	Bounded plans, nonlinear scaling
$\ell_p$ (quasi-) norms	$\ell_p$ Bregman	Variable sparsity, lozenge/diamond shapes
Quadratic (Euclidean) norm	Mahalanobis / Euclidean	Sparse, interpretable maps
Hellinger/other f-divergences	Hellinger/other	Control spread, robust marginals

The Bregman geometry—especially the dually flat structure and the generalized Pythagorean theorem—illuminates how mass is spread or concentrated in the optimal plan, and provides a bridge to classical and modern information geometry.

5. Applications and Empirical Results

The practical utility of ROT is illustrated in both synthetic experiments and real-world pattern recognition tasks:

Synthetic evaluations: By solving ROT with varying regularizers and penalty levels over cost matrices (e.g., squared Euclidean cost), the paper demonstrates a smooth transition from EMD-like sparse structures at low regularization to high-entropy (spread-out) plans at high regularization. The structure of the plan adapts to the divergence, yielding, e.g., elliptical or diamond-shaped mass distributions.
Audio scene recognition: Using the TUT Acoustic Scenes 2016 database, class-conditional GMMs are compared using RMD kernels constructed from regularized plans with different $\varphi$ . An SVM trained with an exponential kernel based on RMD distances achieves competitive or superior classification accuracy compared to a GMM baseline. The Hellinger-divergence-related regularizer provides the best empirical performance, illustrating that the choice of regularizer can dramatically impact recognition outcomes (Dessein et al., 2016).

6. Statistical and Modeling Implications

The introduction of smooth convex regularization yields unique, strictly positive, and information-theoretically meaningful transport plans. This determinacy is crucial for sensitivity analysis, statistical inference, and the application of plug-in and bootstrapped estimators; it stands in marked contrast to the sparsity and instability of classic OT plans. Regularized versions are strictly differentiable with respect to marginals, and their kernelization (e.g., Sinkhorn divergences) inherits properties desirable for kernel methods and learning theory.

ROT thereby enables trade-offs between computational efficiency, statistical stability, and modeling fidelity through the selection of $\varphi$ and $\lambda$ , supporting both dense/flexible and sparse/interpretable solution regimes. This flexibility is essential for applications across machine learning, information geometry, and large-scale data analysis.

7. Broader Impact and Outlook

The unifying Bregman projection framework for ROT encapsulates and systematically generalizes all major trends in modern discrete regularized optimal transport, including but not limited to entropy-based, Mahalanobis, and power-divergence regularizations, and offers a principled path to integrating new divergences and algorithmic enhancements. The availability of efficient projection algorithms and sparse extensions means that ROT is scalable to high dimensions and applicable in large data and kernel settings. The framework not only provides a rigorous theoretical foundation but also establishes direct pipelines to robust, competitive performance in practical tasks, highlighting the critical importance of the regularizer and projection algorithm choices in the real-world success of regularized optimal transport (Dessein et al., 2016).

PDF Markdown Chat (Pro)

References (1)

Regularized Optimal Transport and the Rot Mover's Distance (2016)

Follow Topic

Get notified by email when new papers are published related to Regularized Optimal Transport.