Sinkhorn Discrepancy in Regularized OT

Updated 16 December 2025

Sinkhorn Discrepancy is a regularized OT metric that adds an entropy penalty, yielding a symmetric, scalable measure for comparing probability distributions.
It uses efficient Sinkhorn iterations with acceleration methods like low-rank approximations to compute entropic optimal transport plans.
By tuning the regularization parameter, it interpolates between the Wasserstein distance and MMD, balancing computational efficiency and statistical performance.

The Sinkhorn discrepancy, also referred to as the Sinkhorn divergence, is a regularized optimal transport (OT) metric used to measure the dissimilarity between probability measures. It modifies the classical Wasserstein distance by adding a convex entropy penalty, enabling scalable, numerically stable computations and bridging the gap between OT and kernel-based methods. This regularized approach is central in modern data science, generative modeling, robust optimization, and high-dimensional statistical inference (2002.01189, Genevay et al., 2017, Patrini et al., 2018, Wang, 14 Dec 2025, Cescon et al., 31 Aug 2025, Séjourné et al., 2019).

1. Mathematical Definition and Debiasing

Given two probability measures $\mu$ , $\nu$ on a Polish space (typically $\mathbb{R}^d$ ) and a continuous, non-negative cost function $c(x, y)$ , the entropically regularized OT cost (also called the Sinkhorn cost) is: $W_\varepsilon(\mu, \nu) = \min_{\pi \in \Pi(\mu, \nu)} \int c(x, y) \, d\pi(x, y) + \varepsilon\, \mathrm{KL}(\pi\,\|\,\mu \otimes \nu)$ where $\Pi(\mu, \nu)$ is the set of couplings with marginals $\mu$ and $\nu$ , $\mathrm{KL}$ is the Kullback-Leibler divergence, and $\varepsilon > 0$ is the entropy regularization parameter.

However, $W_\varepsilon(\mu, \nu)$ is biased: it does not vanish when $\mu = \nu$ due to the regularization term. The Sinkhorn divergence (also referred to as the Sinkhorn discrepancy) is defined to remove this bias: $S_\varepsilon(\mu, \nu) = W_\varepsilon(\mu, \nu) - \frac{1}{2} W_\varepsilon(\mu, \mu) - \frac{1}{2} W_\varepsilon(\nu, \nu)$ This construction yields a divergence that is symmetric, non-negative, and vanishes if and only if $\mu = \nu$ under mild conditions (Patrini et al., 2018, Genevay et al., 2017, 2002.01189, Séjourné et al., 2019).

2. Computational Algorithms: Sinkhorn Iterations and Variants

Sinkhorn divergence is evaluated by solving the entropy-regularized OT problem via Sinkhorn's matrix scaling algorithm. For discrete measures $\mu = \sum_{i=1}^n a_i \delta_{x_i}$ , $\nu = \sum_{j=1}^n b_j \delta_{y_j}$ and cost $C_{ij} = c(x_i, y_j)$ , one forms the Gibbs kernel $K_{ij} = \exp(-C_{ij} / \varepsilon)$ , then seeks $u, v > 0$ such that: $P^* = \mathrm{diag}(u) K \mathrm{diag}(v)$ matches marginals $P^*\mathbf{1} = a$ , $(P^*)^\top\mathbf{1} = b$ . The classic Sinkhorn–Knopp iterations are: $u \leftarrow a / (K v), \quad v \leftarrow b / (K^\top u)$ with entrywise division. Theoretical results guarantee convergence under broad conditions (Genevay et al., 2017, Patrini et al., 2018, Altschuler et al., 2018, Pichler et al., 2021).

Acceleration strategies include:

Low-rank/Nyström approximation: reduces cost from $O(n^2)$ to $O(nr)$ per iteration via low-rank positive decompositions of $K$ (Altschuler et al., 2018, Scetbon et al., 2020).
Hierarchical and Kronecker product methods: for separable costs, algorithms achieve $O(n \log^3 n)$ complexity via fast matrix-vector multiplies (Motamed, 2020).
Newton–type and sparse-Hessian algorithms: exploiting local sparsity in the OT plan, super-exponential convergence in Newton’s method is achieved after Sinkhorn warm-up (Tang et al., 20 Jan 2024).

Recent advances provide fully differentiable implementations, admitting backpropagation through unrolled Sinkhorn steps, crucial in deep learning and generative modeling settings (Genevay et al., 2017, Patrini et al., 2018).

3. Interpolation Properties and Statistical Trade-offs

The Sinkhorn divergence continuously interpolates between the Wasserstein distance and maximum mean discrepancy (MMD) as the regularization parameter $\varepsilon$ varies:

As $\varepsilon \rightarrow 0$ : $S_\varepsilon(\mu, \nu) \to W_c(\mu, \nu)$ , the classical unregularized OT distance.
As $\varepsilon \rightarrow \infty$ : $S_\varepsilon(\mu, \nu)$ converges to an MMD with kernel $-c$ , specifically

$\frac{1}{2} [\mathrm{MMD}_K^2(\mu, \nu)]$

for $c(x, y) = -K(x, y)$ (2002.01189, Genevay et al., 2017).

This interpolation has major implications:

For small $\varepsilon$ , Sinkhorn divergence inherits OT’s geometric interpretability but suffers from the curse of dimensionality in empirical estimation.
For large $\varepsilon$ , statistical rates approach $O(n^{-1/2})$ (MMD-style) instead of the poor $O(n^{-1/d})$ for high-dimensional OT (Genevay et al., 2017, Patrini et al., 2018).
Tuning $\varepsilon$ enables practitioners to trade computational efficiency, sample complexity, bias, and robustness (Chizat et al., 2020, Patrini et al., 2018).

4. Theoretical Guarantees, Error Bounds, and Regularity

Key theoretical properties include:

Strong convexity of the regularized objective: ensures unique OT plan and guarantees fast (linear) convergence in Sinkhorn iterations.
Debiasing: $S_\varepsilon$ removes leading-order entropic bias: for sufficiently smooth $\mu, \nu$ , the bias is $O(\varepsilon^2)$ versus $O(\varepsilon \log(1/\varepsilon))$ for $W_\varepsilon$ (Chizat et al., 2020).
Sample complexity: For empirical (plug-in) estimators, optimal rates are achieved by balancing variance and bias through the choice $\varepsilon \sim n^{-1/(d+4)}$ (Chizat et al., 2020). Richardson extrapolation in $S_\varepsilon$ can improve bias further (Chizat et al., 2020).
Regularity: Sinkhorn divergence is differentiable, positive definite, convex (in each argument), and metrizes weak convergence for several choices of cost and regularization (Séjourné et al., 2019, Genevay et al., 2017). Dual potentials and the variational structure admit Hadamard differentiability in RKHS settings, enabling precise control for second-order coreset compression (Kokot et al., 28 Apr 2025).

5. Application Domains and Empirical Evidence

Applications span:

Generative modeling: Sinkhorn divergences serve as loss functions in autoencoders and GANs, providing improved geometry, tractable gradients, and robust sample complexity. Empirical studies confirm that generative models trained with Sinkhorn losses achieve state-of-the-art performance on standard benchmarks (MNIST, CelebA, CIFAR-10), with stable trade-offs between FID and reconstruction (Patrini et al., 2018, Genevay et al., 2017, Scetbon et al., 2020).
Distributionally robust optimization (DRO): Ambiguity sets based on Sinkhorn distances yield tractable robust programs with improved generalization, smoother adversarial perturbations, and closed-form worst-case measures (e.g., mixtures of tilted Gaussians) (Wang, 14 Dec 2025, Cescon et al., 31 Aug 2025).
High-dimensional numerical integration and sampling: Fast stabilized algorithms (e.g., via hierarchical Kronecker or Nyström compressions) permit Sinkhorn discrepancy computation at massive scale, adapting to low-dimensional manifold structure and yielding exponential speedups over naive approaches (Altschuler et al., 2018, Motamed, 2020, Scetbon et al., 2020).
Coreset and dataset compression: The logarithmic-sample coreset theory leverages the second-order smoothness of Sinkhorn divergence for lossless kernel-mean compression, enabling dataset selection with near-optimal sample efficiency for OT purposes (Kokot et al., 28 Apr 2025).

A summary of empirical findings is given below:

Setting	Algorithmic Result	Empirical Conclusion
Large $n$ OT	Nyström/Sinkhorn, $O(nr)$ or $O(n\log^3 n)$	$100\times$ – $1000\times$ speedup, near-zero loss of accuracy (Altschuler et al., 2018, Motamed, 2020)
Deep learning	Backprop through Sinkhorn, GPU execution	Stable end-to-end training with unbiased gradients (Genevay et al., 2017)
DRO/classification	Sinkhorn robust optimization	Improved worst-case risk under adversarial shifts (Wang, 14 Dec 2025)

6. Limitations and Numerical Considerations

Principal limitations and numerical issues include:

Approximation error: Sinkhorn discrepancy only converges to true Wasserstein distance as $\varepsilon\to0$ ; nonzero regularization induces systematic bias, albeit mitigated by debiasing (Lu et al., 2018, Chizat et al., 2020).
Instability for small $\varepsilon$ : Sinkhorn matrix-scaling may hit zero denominators when $\varepsilon$ is too small, causing scalings to fail or oscillate (Lu et al., 2018, Patrini et al., 2018). Ensuring support of the optimal plan remains strictly positive is essential for stable convergence.
Trade-off selection: The hyperparameter $\varepsilon$ requires careful tuning. Too large values induce excessive smoothing (underestimating OT cost), while too small values render the computation numerically unstable or slow. Empirically, moderate values, e.g., $\varepsilon \approx 10^{-1}$ – $10^0$ in latent-space learning, provide a practical compromise (Patrini et al., 2018).
Implementation details: Efficient linear algebra, stabilization in log-domain, and GPU-acceleration are necessary for practical compute times at scale (Genevay et al., 2017, Altschuler et al., 2018, Scetbon et al., 2020).

7. Extensions: Unbalanced and Generalized Sinkhorn Divergences

The Sinkhorn framework generalizes to unbalanced optimal transport, accommodating measures of differing mass via additional Csiszár-divergence penalties on the marginals. The corresponding unbalanced Sinkhorn divergence admits efficient generalized matrix-scaling and preserves convexity, differentiability, and statistical robustness properties (Séjourné et al., 2019).

Further, feature-map approaches (e.g., positive-feature kernels) allow Sinkhorn divergences to be computed in linear time per iteration under suitable cost parameterizations, enabling scalable training in adversarial networks and kernel methods (Scetbon et al., 2020).

References:

(Patrini et al., 2018) Sinkhorn AutoEncoders (Genevay et al., 2017) Learning Generative Models with Sinkhorn Divergences (Séjourné et al., 2019) Sinkhorn Divergences for Unbalanced Optimal Transport (2002.01189) From Optimal Transport to Discrepancy (Chizat et al., 2020) Faster Wasserstein Distance Estimation with the Sinkhorn Divergence (Altschuler et al., 2018) Massively scalable Sinkhorn distances via the Nyström method (Motamed, 2020) Hierarchical Low-Rank Approximation of Regularized Wasserstein Distance (Kokot et al., 28 Apr 2025) Coreset selection for the Sinkhorn divergence and generic smooth divergences (Scetbon et al., 2020) Linear Time Sinkhorn Divergences using Positive Features (Wang, 14 Dec 2025) Iterative Sampling Methods for Sinkhorn Distributionally Robust Optimization (Tang et al., 20 Jan 2024) Accelerating Sinkhorn Algorithm with Sparse Newton Iterations (Cescon et al., 31 Aug 2025) On the Global Optimality of Linear Policies for Sinkhorn Distributionally Robust Linear Quadratic Control (Lu et al., 2018) Brenier approach for optimal transportation between a quasi-discrete measure and a discrete measure (Pichler et al., 2021) Nested Sinkhorn Divergence To Compute The Nested Distance