Sinkhorn’s Algorithm Overview

Updated 27 February 2026

Sinkhorn’s Algorithm is a foundational iterative method that transforms nonnegative matrices into doubly stochastic matrices via alternating row and column scaling.
It underpins computational optimal transport and entropy-regularized models by recasting the scaling process as mirror (Bregman) descent, offering precise convergence guarantees.
Recent developments include accelerated sparse Newton methods, continuous-time flows, and generalizations to operator scaling, enhancing applicability across diverse fields.

Sinkhorn’s algorithm is a foundational iterative method for transforming a nonnegative matrix (or, more generally, a kernel or coupling) into a doubly stochastic matrix or marginal-matching coupling via a sequence of alternating row and column scalings. Originally developed for matrix normalization, the method has become central in computational optimal transport (OT), entropy-regularized transport, probabilistic inference, scalable generative modeling, and operator scaling. Recent progress has recast Sinkhorn’s iterations as mirror (Bregman) descent in various metric geometries, revealed precise convergence thresholds, established explicit PDE and stochastic-process limits of the dynamics, and generalized the approach to high-dimensional, operator, or online settings.

1. Mathematical Formulation and Classical Algorithm

The classic Sinkhorn–Knopp iterative proportional fitting procedure (IPFP) addresses the problem of finding strictly positive diagonal matrices $X, Y$ such that $X A Y$ is doubly stochastic for a given entrywise positive matrix $A \in \mathbb{R}^{n \times n}$ . The prototypical algorithm alternately rescales rows and columns: $\begin{align*} A^{(k+1/2)}_{ij} &= \frac{A^{(k)}_{ij}}{r^{(k)}_i},\quad r^{(k)}_i = \sum_j A^{(k)}_{ij}, \ A^{(k+1)}_{ij} &= \frac{A^{(k+1/2)}_{ij}}{c^{(k)}_j},\quad c^{(k)}_j = \sum_i A^{(k+1/2)}_{ij}. \end{align*}$ Iteration converges to a doubly stochastic $S(A)$ with $S(A) = X A Y$ , and this solution is unique up to scaling. When extended to the entropy-regularized OT problem, the Sinkhorn algorithm applies to $P \in \mathbb{R}^{n \times n}_{\ge 0}$ minimizing

$\langle C, P \rangle + \varepsilon \sum_{i,j} P_{ij} \log P_{ij}$

subject to $P 1 = a$ , $P^T 1 = b$ , with a cost matrix $C$ and marginals $a$ , $b$ .

The algorithm operates on the Gibbs kernel $K = \exp(-C/\varepsilon)$ : $P = \operatorname{diag}(u) K \operatorname{diag}(v),$ updating via

$u \leftarrow a./(K v), \quad v \leftarrow b./(K^T u),$

where $./$ denotes elementwise division.

2. Convergence Analysis, Rates, and Phase Transition

Convergence proofs for positive matrices (Sinkhorn–Knopp theorem) rely on either contraction in the Hilbert projective metric or the monotonic descent of a Bregman (e.g., Kullback-Leibler) potential. In the general rectangular matrix scaling problem, the number of alternations $T$ necessary to reach $\ell_1$ error $\epsilon$ is $T=O(n^2\ln(1/\epsilon)/\epsilon^2)$ , with sharper bounds for $\ell_2$ error using refined inequalities linking KL divergence to $\ell_2$ norms (Chakrabarty et al., 2018).

Recent analysis (He, 13 Jul 2025) revealed a sharp phase transition for convergence rates at the matrix density threshold $\gamma=1/2$ . When the normalized matrix is $(\gamma,\rho)$ -dense for $\gamma>1/2$ (i.e., every row and column has at least a fraction $\gamma$ of entries bounded below), convergence is achieved in $O(\log n - \log \epsilon)$ iterations; for $\gamma<1/2$ , lower bounds of $\Omega(n/\epsilon)$ iterations under $\ell_1$ and $\Omega(\sqrt{n}/\epsilon)$ under $\ell_2$ error apply. Fast, near-optimal convergence is thus characteristic mainly for sufficiently "dense" inputs.

In degenerate cases (zero entries, incompatible marginals), the algorithm does not converge to a unique solution but oscillates between two extremal limit points; recent work provides combinatorial support-reduction methods to recover fast convergence in these settings (Baradat et al., 2022).

3. Variational and Mirror-Descent Perspectives

Sinkhorn iterations are exact instances of block-coordinate ascent in the dual of entropic OT. The dual potentials formulation expresses the equilibrium as

$\phi(x, y) = -\varepsilon \sum_{i,j} \exp\left(\frac{x_i + y_j - C_{ij}}{\varepsilon} - 1\right) + \langle a, x \rangle + \langle b, y \rangle,$

with the maximization over $x, y$ yielding $P^* = \exp((x^* + y^{*T} - C)/\varepsilon - 1)$ .

Conceptually, Sinkhorn’s algorithm is a mirror descent (Bregman gradient descent) on the Kullback-Leibler divergence, with each scaling step corresponding to a KL projection onto affine marginals. This mirror-descent viewpoint leads to sublinear $O(1/n)$ convergence rates that are robust to zeros in the kernel or reference, and forms the basis for stochastic and continuous-time variants (2002.03758, Mishchenko, 2019). For entropic-regularized OT, the continuous limit is called the "Sinkhorn flow," a gradient flow in the space of measures endowed with the Fisher–Rao or a mirror-geometry (Karimi et al., 2023, Srinivasan et al., 14 Oct 2025).

4. Algorithmic Acceleration, Sparse Newton Methods, and Extensions

For large-scale problems and small regularization $\varepsilon$ , standard Sinkhorn can require many iterations. The "Sinkhorn–Newton–Sparse" (SNS) method (Tang et al., 2024) accelerates convergence by blending conventional Sinkhorn warmup with a sparse Newton phase. The Hessian of the dual potential—the key operator in second-order methods—is shown to be approximately sparse for practical OT plans. This allows $O(n^2)$ per Newton iteration, matching the cost of Sinkhorn. Empirically, SNS achieves orders-of-magnitude speedup: e.g., on $n = 500$ MNIST OT instances, SNS may iterate 53 times (2.33s), versus 2041 steps (18.84s) for classic Sinkhorn.

Online, stochastic, and compressed Sinkhorn schemes have been proposed for streaming or large-data settings (Wang et al., 2023), utilizing sparse measure-representations and compression via quadrature or Fourier moments, with improved complexity–accuracy trade-offs.

Generalizations include Sinkhorn methods for Gaussian OT (explicit Riccati-type iterated updates), multi-marginal and operator scaling—where row/column sums are replaced by spectral or more complex moment constraints—and the operator Sinkhorn algorithm used for noncommutative rank and matroid optimization (Akyildiz et al., 2024, Franks et al., 2022).

5. Geometric, Dynamical, and PDE Limits

In the scaling limit as $\varepsilon \to 0$ with the iteration count rescaled as $k \sim t/\varepsilon$ , Sinkhorn's algorithm converges to a nonlinear parabolic Monge–Ampère equation (PMA),

$\partial_t u_t(x) = f(x) - g(x^{u_t}) + \log\det(\partial x^{u_t}/\partial x),$

where $x^{u_t} = \nabla u_t(x)$ and $u_t$ is the transport potential. This PDE describes a mirror gradient-flow for KL divergence with the mirror $U(\rho) = \frac{1}{2}W_2^2(\rho, \nu)$ , interpolating between Wasserstein and Fisher–Rao geometries and connecting to Ricci flow, Schrödinger bridges, and stochastic diffusion limits (Deb et al., 2023, Berman, 2017, Modin, 2023). Exponential convergence holds under a log-Sobolev (LSI) or displacement-convexity condition.

At the measure-theoretic level, the Beurling theorem for product measures is a functional-analytic precursor to Sinkhorn's theorem, establishing uniqueness and homeomorphic parameterizations for the Schrödinger bridge coupling (Modin, 2023).

6. Continuous-Time and Stochastic Sinkhorn Flows

The continuous-time limit of Sinkhorn is a mirror descent on the space of couplings, with the key flow equations: $\partial_t g_t(y) = -\log\left(\frac{(\pi_t)_Y(y)}{\nu(y)}\right), \quad \pi_t = \text{Cpl}[g_t],$ where $\pi_t$ is the coupling at time $t$ , and $g_t$ is the dual potential. Discretizing this ODE recovers classical and "undamped" Sinkhorn iterates as explicit Euler schemes (Karimi et al., 2023, Srinivasan et al., 14 Oct 2025). Mirror descent provides convergence and robustness properties for noisy or biased gradient estimators.

Further, the Sinkhorn flow induces a reversible Markov dynamic on the marginals, equipped with a nonlocal Onsager gradient–Dirichlet form and spectral gap–Poincaré inequalities. Exponential decay of relative entropy under a logarithmic Sobolev inequality is necessary and sufficient for exponential convergence in the continuous flow.

7. Algorithmic Generalizations and Practical Implications

Many variants extend Sinkhorn’s method beyond square nonnegative matrices and doubly stochastic scaling. The S-D/I (Sinkhorn with Deletion/Insertion) algorithm enables partial matching and differentiable loss for variable-cardinality assignments, permitting seamless integration with deep learning frameworks (Brun et al., 2021).

Operator Sinkhorn generalizes the procedure to completely positive maps (quantum channels), with the minimal scaling enforcing prescribed marginal spectra via alternating minimization in geodesically convex domains (Franks et al., 2022). The convergence rate and computational complexity for such generalized scaling is polynomial in the matrix dimension, marginal support, input bit-size, and $1/\epsilon$ for approximation accuracy.

Applications span generative modeling, domain adaptation, high-dimensional Gaussian inference, combinatorial optimization, network analysis, and quantum information. The theorem of exact homeomorphic scaling underlies unique existence of the Schrödinger bridge coupling in both discrete and continuous settings, and mirror-solvers provide a meta-tool for algorithmic innovation across geometry, optimization, and learning.