Sinkhorn Distance for Optimal Transport

Updated 24 March 2026

Sinkhorn distance is an entropic regularization of the classical optimal transport metric, providing a smooth and scalable approximation to OT problems.
It uses an iterative matrix-scaling algorithm that enforces row and column constraints through fixed-point updates, ensuring numerical stability.
Advanced variants, such as Sparse Newton and low-rank approximations, significantly boost convergence and scalability for large-scale transport tasks.

The Sinkhorn distance is an entropic regularization of the classical optimal transport (OT) distance, defined between discrete or continuous probability distributions. By introducing a strongly convex entropy term in the transportation problem, the Sinkhorn distance enables numerically stable, smooth approximations of OT and leads to scalable matrix-scaling algorithms. The Sinkhorn algorithm and its advanced variants underpin a large body of contemporary research and applications in computational optimal transport, machine learning, data science, and stochastic modeling.

1. Formal Definition and Variational Structure

Given two discrete probability vectors $a, b \in \mathbb{R}^n$ and a ground cost matrix $C \in \mathbb{R}^{n \times n}$ , the entropic-regularized OT problem, or Sinkhorn distance, is formulated as: $\min_{P \in U(a, b)} \; \langle P, C \rangle - \varepsilon H(P),$ where

$U(a, b) = \{ P \ge 0 : P \mathbf{1} = a, \; P^T \mathbf{1} = b \}$

and

$H(P) = -\sum_{i,j} P_{ij} (\log P_{ij} - 1)$

is the (Shannon) entropy of $P$ . The parameter $\varepsilon > 0$ controls the regularization strength.

By Lagrange duality, the entropic OT can equivalently be written in a dual (variational) form as

$\max_{x, y \in \mathbb{R}^n} f(x, y)$

with

$f(x, y) = -\frac{1}{\eta} \sum_{i,j} \exp\left(\eta(-C_{ij} + x_i + y_j) - 1\right) + \sum_i a_i x_i + \sum_j b_j y_j$

where $\varepsilon = 1/\eta$ . The optimal transport plan is recovered via

$P^* = \exp\left(\eta(-C + x^* \mathbf{1}^T + \mathbf{1}(y^*)^T) - 1\right).$

This duality establishes a Lyapunov potential that is central to both the convergence analysis and algorithmic development of Sinkhorn-type algorithms (Tang et al., 2024).

2. The Classical Sinkhorn Algorithm

The basic numerical method for solving the entropic OT problem is the Sinkhorn (or Sinkhorn-Knopp) matrix-scaling algorithm, whose key steps are:

Define the Gibbs kernel $K = \exp(-C/\varepsilon)$ .
Seek positive scaling vectors $u, v$ such that the optimal plan is

$P = \operatorname{diag}(u) K \operatorname{diag}(v)$

and satisfies $P \mathbf{1} = a$ , $P^T \mathbf{1} = b$ .

Perform the fixed-point (coordinate ascent) iterations:

$u \leftarrow a / (K v), \quad v \leftarrow b / (K^T u)$

where all operations are entrywise.

Each iteration enforces row or column sum constraints. The per-iteration complexity is $O(n^2)$ for general dense cost matrices, with convergence rate typically linear in theory, yet often much faster in practice for moderate $\varepsilon$ (Cuturi, 2013, Tang et al., 2024).

3. Accelerated and Sparse Newton Variants

Despite the efficiency of standard Sinkhorn, the total number of iterations may be substantial, especially for large problem sizes or extreme regularization. Recent advances exploit the variational structure and empirical sparsity of the optimal plans:

The Hessian of the Lyapunov potential $f(x,y)$ is block-structured:

$\nabla^2 f(x, y) = \eta \begin{pmatrix} \mathrm{diag}(P {\mathbf{1}}) & P \ P^T & \mathrm{diag}(P^T {\mathbf{1}}) \end{pmatrix}$

For typical problems with large $\eta$ (small entropy), $P$ becomes nearly sparse, leading to approximate sparsity of the Hessian (Tang et al., 2024).

The Sinkhorn-Newton-Sparse (SNS) algorithm leverages this structure:
- Stage I: Early-stopped Sinkhorn warm-start to produce a sparse $P$ .
- Stage II: Sparse-Newton steps, in which only the largest $\lambda n^2$ entries of the Hessian are retained (with $\lambda = O(1/n)$ ), and each Newton update is performed using conjugate gradients.
SNS achieves per-iteration complexity $O(n^2)$ , but with super-exponential local convergence in the Newton phase, reducing total iteration counts by up to 2–3 orders of magnitude compared to classical Sinkhorn on practical tasks (e.g., MNIST, large random assignment) (Tang et al., 2024).

4. Computational and Algorithmic Scalability

Several strategies make Sinkhorn-based distances scalable to massive datasets:

Low-rank kernel approximation: The Nyström method approximates the kernel $K$ as $K \approx V A^{-1} V^T$ with $r \ll n$ , reducing memory and matvec costs to $O(nr)$ per iteration for problems with suitable kernel structure (Altschuler et al., 2018).
Manifold and sparse structure: On data with graph or manifold geometry, the Geodesic Sinkhorn method replaces the Gaussian kernel with sparse heat kernels, using Chebyshev polynomial expansions to achieve per-iteration costs near $O(n \log n)$ (Huguet et al., 2022).
Parallelization and specialization: Highly parallel, batched, and sparse implementations (e.g., for Sinkhorn Word Movers Distance) achieve near-peak bandwidth and linear or superlinear scaling on modern CPU and memory hardware by exploiting workload structure and fused sparse-dense matmul kernels (Tithi et al., 2021).
Screened algorithms: The Screenkhorn algorithm identifies negligible dual variables via KKT analysis, prunes them, and solves a much smaller subproblem, further reducing both arithmetic and memory costs with provable control of approximation error (Alaya et al., 2019).

Empirically, these methods enable the computation of Sinkhorn distances on datasets with up to millions of points or high-dimensional feature spaces.

5. Theoretical Properties and Extensions

The entropic Sinkhorn distance inherits several important properties:

Regularized metric structure: For any $\varepsilon>0$ , it is smooth (infinitely differentiable) in the input histograms, metrizes weak convergence, and satisfies a regularized triangle inequality. It converges to the true OT distance as $\varepsilon \to 0$ (Cuturi, 2013, Luise et al., 2018).
Gradient and duality: The gradient of the Sinkhorn loss with respect to input distributions admits a closed-form via implicit differentiation of the dual potentials, enabling efficient integration in end-to-end learning frameworks (Luise et al., 2018).
Riemannian geometry: The Hessian of the Sinkhorn divergence (debiased entropic OT) induces a metric tensor on the space of probability measures, providing a Riemannian geometry analogous to but distinct from the classical Wasserstein metric (Lavenant et al., 2024).
Generalizations: The entropic regularization can be replaced by more general $f$ -divergences, yielding generalized Sinkhorn distances that handle support mismatch and provide uncertainty models for robust optimization (Yang et al., 29 Mar 2025). Chain-rule variants define distances between mixture models, where the ground cost is itself a divergence between conditionals (Nielsen et al., 2018).

Table: Key Theoretical Properties

Property	Sinkhorn (ε>0)	Wasserstein (ε=0)
Convexity/Strict Convexity	Strict	Convex
Differentiability	$C^{\infty}$ inside	Typically non-smooth
Metric	Pseudo-metric	Metric
Bias to OT	$O(\varepsilon)$	Unbiased
Computability	Fast (matrix scaling)	Slow (linear program)

6. Applications and Empirical Findings

Sinkhorn distances have been widely adopted in machine learning and related fields:

Distributional matching and domain adaptation: Used as a differentiable loss for comparing empirical distributions (e.g., in imitation learning with occupancy measures, barycenter computation, and structured data interpolation) (Papagiannis et al., 2020, Huguet et al., 2022).
Robust optimization (DRO): Generalized Sinkhorn distances enable uncertainty modeling over distributions with non-overlapping supports and robust regularization in adversarial settings, with provable convergence rates for stochastic-gradient schemes (Yang et al., 29 Mar 2025).
Computational geometry and scientific computing: Near-linear algorithms allow Sinkhorn computations in high dimensions, on manifolds, and for large-scale problems in signal processing and shape analysis (Motamed, 2020, Berman, 2017).

Empirical results consistently demonstrate orders-of-magnitude speedups in wall-clock time and/or sample size over classical OT solvers. For instance, sparse Newton methods have reduced OT solve times on MNIST $W_2$ transport from 18.8 s (2041 Sinkhorn iterations) to 2.33 s (53 total iterations combining Sinkhorn and Newton phases) (Tang et al., 2024). In barycenter and biological data analyses, Sinkhorn distances incorporating geodesic structure markedly outperform Euclidean versions in both accuracy and computational efficiency (Huguet et al., 2022).

7. Limitations and Ongoing Research

Despite broad success, Sinkhorn distances entail several notable analytic and algorithmic subtleties:

The Sinkhorn divergence (debiased entropic OT) is symmetric and smoother than Wasserstein but fails to satisfy joint convexity and the triangle inequality in general. Its geometrically motivated Riemannian version remedies these but at the cost of increased analytic complexity (Lavenant et al., 2024).
The regularization parameter $\varepsilon$ yields a bias-variance tradeoff: larger $\varepsilon$ improves numerical stability and speed but induces more bias in approximating unregularized OT; smaller $\varepsilon$ reduces bias but increases iteration counts and can cause numerical instability (Cuturi, 2013, Chizat et al., 2020).
On specific data structures (e.g., highly manifold-structured or low-rank cost), specialized algorithms (e.g., Chebyshev expansions, hierarchical matrices, Nyström approximations) are required to achieve theoretical computational gains (Altschuler et al., 2018, Motamed, 2020, Huguet et al., 2022).
For very small $\varepsilon$ , the solution approaches OT but the matrix-scaling algorithm may suffer from slow mixing and numerical issues; alternative rounding and stabilization procedures, as well as two-phase or online methods, are active areas of development (Tang et al., 2024, Mensch et al., 2020).

Continued research focuses on higher-order optimization, adaptive screening, advanced divergence-based regularization, integration with deep learning, and geometric interpretations of entropic OT. These developments collectively anchor the Sinkhorn distance as the central computational tool for scalable, regularized optimal transport across statistics, machine learning, and computational mathematics.