Divergence Regularized OT

Updated 6 October 2025

Divergence regularized optimal transport is a framework that augments classical OT with convex divergence penalties to improve computational tractability and induce sparsity.
It leverages diverse divergence measures such as KL, Tsallis, and Rényi to balance regularization strength with data fidelity, ensuring strong statistical guarantees.
The approach supports scalable algorithms with provable convergence rates, enabling robust applications in machine learning, statistics, signal processing, and beyond.

Divergence regularized optimal transport (DROT) refers to a class of optimal transport (OT) problems in which the classical linear programming formulation is modified by incorporating a divergence-based regularization term. This approach generalizes entropic OT, allowing the use of a wide variety of convex divergences—most commonly those generated by strictly convex or Legendre-type functions—to enhance computational tractability, induce smoothing or sparsity in transport plans, and control the statistical behavior of empirical estimators. The unification and systematic exploration of DROT has led to powerful algorithmic frameworks, sharp statistical guarantees, new geometric and kernelized metrics, and an expanded application range in machine learning, statistics, signal processing, and operations research.

1. Mathematical Framework and Formulation

Let $\mu, \nu$ be probability measures on measurable spaces $(X, \mathcal{F})$ , $(Y, \mathcal{G})$ and $c : X \times Y \rightarrow \mathbb{R}$ a cost function. The standard OT problem seeks a coupling $\pi$ minimizing $\int c\, d\pi$ among all $\pi$ with marginals $\mu$ and $\nu$ . In DROT, one augments this objective by a divergence penalty:

$\operatorname{DROT}_{\varepsilon, \phi}(\mu, \nu) := \inf_{\pi \in \Pi(\mu, \nu)} \left\{ \int c \, d\pi + \varepsilon D_\phi(\pi \,\|\, \mu \otimes \nu) \right\}$

Here, $D_\phi(\pi\,\|\,\mu\otimes\nu) := \int \phi\left(\frac{d\pi}{d(\mu\otimes\nu)}\right)\, d(\mu\otimes\nu)$ is a general $f$ -divergence regularizer induced by a convex function $\phi$ . The parameter $\varepsilon > 0$ balances fidelity to the cost and the effect of the divergence. Replacing the Kullback–Leibler (KL) divergence with other $f$ -divergences or even non- $f$ -divergence functions (such as Rényi divergences with $\alpha \in (0,1)$ ) changes the analytical and computational behavior of the regularized OT formulation (Dessein et al., 2016, Marino et al., 2020, Terjék et al., 2021, Bresch et al., 29 Apr 2024).

Key features:

The KL divergence recovers the classical entropic OT (Sinkhorn) setting.
Tsallis, $L^p$ , or Rényi divergences interpolate between entropy-regularized and unregularized OT (Suguro et al., 2023, Tong et al., 2020, Bresch et al., 29 Apr 2024).
$f$ -divergence regularizers may be chosen to induce desired properties, such as sparsity, robustness, or statistical efficiency (Terjék et al., 2021, González-Sanz et al., 7 May 2025).

2. Key Theoretical Properties

Strong Convexity and Uniqueness

If $\phi$ is strictly convex and suitably smooth (for instance, Legendre type on the interior of its domain), the DROT problem is strictly convex and admits a unique solution $\pi^*$ . This provides stability of the optimal coupling under perturbations of the marginals and enables efficient gradient-based optimization (Dessein et al., 2016, Marino et al., 2020, Terjék et al., 2021, Bayraktar et al., 2022).

Interpolation and Limit Behavior

By varying the divergence parameter (e.g., the order $\alpha$ in Rényi, or the regularization strength $\varepsilon$ ), the DROT interpolates between the classical unregularized OT (as $\varepsilon \to 0$ or $\alpha \to 0$ ) and an "information projection" (as $\varepsilon \to \infty$ or $\alpha \to 1$ for Rényi, which recovers KL regularization) (Bresch et al., 29 Apr 2024, Suguro et al., 2023). For certain classes of divergences (including KL and Rényi), the unique minimizer of the DROT functional converges weakly to the unregularized OT minimizer as the regularization vanishes, with explicit quantitative rates (Eckstein et al., 2022, Suguro et al., 2023, Morikuni et al., 2023).

Sample Complexity and Statistical Guarantees

A foundational result is that for a broad family of divergences (including non-smooth $f$ -divergences such as Tsallis), the empirical DROT cost converges to the population cost at the parametric rate $n^{-1/2}$ , independent of the ambient dimension, provided the cost and divergence are "regular" (e.g., bounded, dual differentiable). This extends the known sample complexity improvements of entropy-regularized OT to much more general settings and refutes the previously held belief that only infinitely-smooth regularizers avoid the curse of dimensionality (González-Sanz et al., 7 May 2025, Yang et al., 2 Oct 2025, Bayraktar et al., 2022).

Central Limit Theorems

General central limit theorems have been established for the optimal cost, the empirical coupling, and dual potentials in DROT problems with sufficiently smooth duals. This allows uncertainty quantification and supports inferential procedures in statistical applications (Klatt et al., 2018, Yang et al., 2 Oct 2025, González-Sanz et al., 7 May 2025).

3. Algorithms and Computation

Generalized Sinkhorn and Mirror Descent

DROT problems can often be solved by iterative scaling algorithms generalizing the Sinkhorn–Knopp iterations. In the case of a Legendre-type regularizer, the dual problem can be formulated (under strong duality) as a maximization over potentials constrained via the Fenchel conjugate of $\phi$ , leading to alternate scaling (and possibly correction) steps for enforcing marginal constraints (Dessein et al., 2016, Marino et al., 2020, Terjék et al., 2021).

For Rényi-regularized OT, which is not an $f$ -divergence nor a Bregman divergence (Bresch et al., 29 Apr 2024), a nested mirror descent algorithm is used. The mirror map is derived from (neg-)entropy or similar convex functions, and each mirror step corresponds to a Bregman projection—often efficiently implementable via scaling-like updates (e.g., Sinkhorn projections).

Table: Algorithmic Approaches

Regularizer	Algorithmic Scheme	Key Properties
KL (entropy)	Sinkhorn, Dual Scaling	Fast, always full support
Legendre $f$ -divergence	Generalized Sinkhorn	Sparse, duals exist
Tsallis / $L^p$	Nonlinear Scaling / Mirror D.	Sparser, possible non-differentiable dual
Rényi ( $\alpha \in (0,1)$ )	Mirror Descent + Sinkhorn	Interpolates EOT/OT, numerically stable

Sparsity and Memory Efficiency

By choosing less-smooth regularizers (e.g., Tsallis, $\ell_p$ ), DROT can produce transport plans that are sparse, that is, supported on a small subset of the product space—a property desirable in many applications and not achievable with entropic regularization, which enforces full support (Dessein et al., 2016, Terjék et al., 2021, González-Sanz et al., 7 May 2025).

Scalability and Parallelization

Efficient algorithms (e.g., domain decomposition (Bonafini et al., 2020), distributed ADMM (Mokhtari et al., 7 Oct 2024)) have been developed that allow DROT to be solved for massive graphs or images, leveraging the strict convexity (when present) and separable structure of the divergence term. Coarse-to-fine schemes and adaptive sparsity are often employed to make large-scale problems tractable.

4. Generalizations and Metric Properties

Beyond $f$ -divergences

Not all useful divergences are $f$ -divergences or Bregman distances. Rényi divergences—a key focus in recent research—provide a family of regularizers where $\alpha \nearrow 1$ recovers KL regularization and $\alpha \searrow 0$ recovers unregularized OT, all without the numerical instabilities caused by vanishing $\varepsilon$ in KL settings (Bresch et al., 29 Apr 2024).

Metricity and Sinkhorn-Type Divergences

One focus is whether the divergence-regularized cost defines a "pseudo-distance" or a genuine metric (e.g., positivity, symmetry, triangle inequality). Debiased versions, such as Sinkhorn divergences and their unbalanced analogues, are designed to be zero if and only if the measures coincide (Séjourné et al., 2019, Dessein et al., 2016).

Homogeneity

Some regularized OT models respect homogeneity in the input masses (e.g., the HUROT model (Lacombe, 2022)), which is important in physical and geometric applications and sometimes lost in standard regularized OT schemes.

5. Applications and Empirical Performance

Model Selection and Inference

DROT with a suitable divergence allows practitioners to interpolate between highly regularized (smooth, full-support) and nearly unregularized (sparse, map-like) transport plans by tuning the divergence parameter and regularization strength. For example, Rényi-regularized plans with intermediate $\alpha$ track the unregularized OT plan closely and surpass KL or Tsallis regularized OT in recovering true conditional migration tables in practical inference tasks (Bresch et al., 29 Apr 2024).

Statistical Testing and Bootstrap

The parametric convergence rates and central limit theorems for DROT distances facilitate the construction of hypothesis tests and confidence intervals in high-dimensional statistics (Bigot et al., 2017, Klatt et al., 2018, Yang et al., 2 Oct 2025), as well as boostrappable algorithms for assessing the variability of empirical OT values.

Machine Learning and Signal Processing

DROT forms the backbone of kernels for SVMs over complex data (Dessein et al., 2016), barycenter and barycentric interpolation in signal processing and genomics (Manupriya et al., 2020), robust generative modeling and distributionally robust optimization (Birrell et al., 2023, Baptista et al., 17 May 2025), high-dimensional image and point cloud matching (Séjourné et al., 2019, Bonafini et al., 2020, Mokhtari et al., 7 Oct 2024), and even OT-based prompt ensembling for vision-LLMs (Manupriya et al., 2020).

Empirical benchmarks consistently indicate that non-entropic regularizers—such as Rényi, Tsallis, or MMD-based penalties—can provide better approximations to ground-truth couplings or improve power in two-sample and domain adaptation tasks, with practical advantages in numerical stability and sparsity (Bresch et al., 29 Apr 2024, Manupriya et al., 2020, González-Sanz et al., 7 May 2025).

6. Open Problems and Future Directions

Non- $f$ -divergence Regularizers: Full extension of theory and algorithms to encompass all useful divergence types (e.g., Rényi, MMD) while preserving fast rates and dual feasibility is an active area (Bresch et al., 29 Apr 2024, Manupriya et al., 2020).
Strong Duality and Generalized Algorithms: While strong duality holds for a wide class of Legendre-type divergences, identifying conditions under which generalized scaling algorithms converge for other divergence types remains an ongoing line of research (Terjék et al., 2021, Marino et al., 2020).
Infinite-Dimensional and Functional Transport: Extensions to continuous measure spaces, as well as generalization to multi-marginal or dynamic settings (mean-field games, time-dependent flows) are increasingly tractable due to recent advances in variational analysis and dynamic programming (Baptista et al., 17 May 2025).
Robustness and Interpretability: How divergence regularization interacts with outlier robustness, interpretability of transport plans, and learning under high-noise regimes is not yet fully resolved and is being actively investigated (Séjourné et al., 2019, Manupriya et al., 2020, Birrell et al., 2023).
Empirical Process Theory: Statistical Z-estimation theory and non-Donsker techniques are beginning to yield rigorous results on finite-sample performance and uncertainty quantification for DROT estimators, with ongoing work on more general classes and under weaker assumptions (González-Sanz et al., 7 May 2025, Yang et al., 2 Oct 2025, Bayraktar et al., 2022).

7. Summary Table of Representative Divergences

Divergence Type	Key Mathematical Feature	Implementation/Algorithm	Plan Support	Statistical Rate
KL (Entropic)	$x \log x$ (Legendre type)	Sinkhorn, scaling	Full support	$n^{-1/2}$
Tsallis ( $L^p$ )	$x^q$ (non-differentiable at $0$)	Nonlinear scaling/mirror desc.	Sparse/partial	$n^{-1/2}$
Rényi ( $\alpha$ )	Neither $f$ -divergence nor Bregman	Mirror descent + Sinkhorn	Tuning: interpolates	$n^{-1/2}$
MMD	RKHS metric-based	Convex QP, APGD	Sample-supported	$m^{-1/2}$
Bregman general	$U(x)$ strictly convex, barrier	Scaling, projections, Dykstra	Flexible	$n^{-1/2}$

In summary, divergence regularized optimal transport encompasses a theoretically and practically rich set of methodologies generalizing OT via flexible, convex divergence terms. These methods offer unique control over regularity, sparsity, and statistical performance of the transport plan; underpin efficient, scalable computational routines; and are well supported by recent theoretical developments establishing dimension-free convergence rates, central limit theorems, and robust duality properties over a wide family of divergences. This framework supports a broad and growing array of applications in contemporary data science, statistics, and optimization (Dessein et al., 2016, González-Sanz et al., 7 May 2025, Yang et al., 2 Oct 2025, Bresch et al., 29 Apr 2024).