Sinkhorn–Knopp Algorithm

Updated 5 January 2026

Sinkhorn–Knopp Algorithm is an iterative method that scales matrices to target row and column sums, achieving doubly stochastic properties under suitable conditions.
It alternates between row and column rescaling, leveraging convex optimization and mirror descent principles to ensure convergence with provable error bounds.
Accelerated variants using overrelaxation and Krylov-type methods improve efficiency, particularly for high-dimensional and sparse matrix problems.

The Sinkhorn–Knopp (SK) algorithm is a foundational iterative procedure for scaling the rows and columns of a nonnegative matrix to achieve prescribed sums. Originally developed to construct doubly stochastic matrices, it has evolved into an indispensable tool in matrix analysis, entropic optimal transport, numerical linear algebra, economic complexity, and computational geometry. The method alternates between rescaling rows and columns and admits elegant connections to convex optimization, information geometry, and dynamical systems. Its convergence guarantees, phase complexity, extensions to structural constraints, and diverse algorithmic accelerations have produced a rich and actively developing body of literature.

1. Formal Definition, Iterative Scheme, and Existence

Given a nonnegative matrix $A\in\mathbb{R}_+^{n\times m}$ and target row sum vector $r\in\mathbb{R}_{>0}^n$ , column sum vector $c\in\mathbb{R}_{>0}^m$ with $\|r\|_1 = \|c\|_1 = h$ , the matrix scaling problem is to find diagonal matrices $R, S$ such that $R A S$ has row sums $r$ and column sums $c$ . A matrix is said to be $(r,c)$ -scalable if such $R,S$ exist. The classical Sinkhorn–Knopp iteration operates as follows (Chakrabarty et al., 2018):

Initialization (column scaling):

$A^{(0)}_{ij} := A_{ij} \cdot \frac{c_j}{\sum_{i'} A_{i'j}}$

Row-scaling (step $t$ ):

$B^{(t)}_{ij} := A^{(t)}_{ij} \cdot \frac{r_i}{\sum_{j'} A^{(t)}_{ij'}}$

Column-scaling (step $t$ ):

$A^{(t+1)}_{ij} := B^{(t)}_{ij} \cdot \frac{c_j}{\sum_{i'} B^{(t)}_{i'j}}$

This process alternates until both marginal constraints are met within a specified tolerance.

For strictly positive $A$ , convergence to the unique scaling is guaranteed (up to normalization) by the Sinkhorn theorem and follows from Perron–Frobenius theory (Modin, 2023). For matrices with zeros, $(r,c)$ -scalability must be verified on the support graph.

2. Interpretation as Alternating Bregman Projections and Mirror Descent

The SK iteration is a manifestation of alternating Bregman projections (specifically, I-projections for KL-divergence) onto the affine spaces defined by the row and column sum constraints. In the context of entropic optimal transport, the problem may be expressed as (Mishchenko, 2019, Modin, 2023):

$\min_{X \ge 0} \; \langle C, X \rangle + \epsilon \sum_{i,j} X_{ij} \log X_{ij} \quad \text{s.t.} \quad X\mathbf{1} = r, \, X^\top \mathbf{1} = c$

Defining the Gibbs kernel $K_{ij} = \exp(-C_{ij}/\epsilon)$ , the unique minimizer admits the factorization $X^* = \operatorname{diag}(u) K \operatorname{diag}(v)$ , with $u, v > 0$ .

Under the mirror-descent framework, with negative entropy as the mirror map and KL-divergence for the penalty, the SK updates are recovered as stochastic mirror descent steps. Each iteration corresponds to a Bregman projection onto the respected marginal constraints, yielding closed-form multiplicative rescalings (Mishchenko, 2019). The process generalizes to multidimensional arrays and admits acceleration by considering full gradients or alternating orderings of projections.

3. Convergence Guarantees, Error Bounds, and Complexity Phase Transition

The convergence of the SK algorithm is quantified via KL-divergence between iterates and the target marginals. The key result for $(r,c)$ -scalable $A$ is (Chakrabarty et al., 2018):

For any $\delta > 0$ , after at most

$T = \left\lceil \frac{\ln(1 + 2 \Delta \rho / \nu)}{\delta} \right\rceil$

alternating steps, at least one of the KL-divergences $D_{KL}(\tilde r^* || \tilde r^{(t)})$ or $D_{KL}(\tilde c^* || \tilde c^{(t)})$ falls below $\delta$ . Here, $\rho = \max_i r_i$ , $\nu = \min_{A_{ij} > 0} A_{ij} / \max_{ij} A_{ij}$ , $\Delta = \max_j | \{ i : A_{ij} > 0 \} |$ .

$\ell_1$ and $\ell_2$ error bounds are obtained via Pinsker's inequality and a new KL– $\ell_1/\ell_2$ bound. For $\ell_1$ , $T = O(h^2 \ln(\Delta\rho/\nu) / \epsilon_1^2)$ suffices for $\| r^{(t)} - r \|_1 \leq \epsilon_1$ . For $\ell_2$ , using the stronger bound, $T = O(\rho h \ln(\Delta\rho/\nu) (1/\epsilon_2 + 1/\epsilon_2^2))$ suffices for $\| r^{(t)} - r \|_2 \leq \epsilon_2$ .

A fundamental discovery is the phase transition in complexity based on the matrix density $\gamma$ (He, 13 Jul 2025). For normalized $A$ with density $\gamma > 1/2$ (i.e., each row and column has at least $\gamma n$ significant entries), SK achieves

$O(\log n - \log \varepsilon)$

iterations and $\tilde O(n^2)$ total time to reach marginal error $\varepsilon$ . For $\gamma < 1/2$ , pathological constructions require

$\tilde \Omega(\sqrt{n} / \varepsilon)$

iterations for $\ell_2$ error, with a sharp threshold at $\gamma = 1/2$ . Thus, on dense matrices SK rapidly converges, while in the sparse regime it is intrinsically slow.

4. Algorithmic Variants and Accelerations

Overrelaxation

Overrelaxed SK algorithms introduce a relaxation parameter $\omega > 1$ to accelerate local convergence (Lehmann et al., 2020, 1711.01851):

$u_{k+1} = u_k^{1 - \omega} \odot (a / (K v_k))^{\omega}, \quad v_{k+1} = v_k^{1 - \omega} \odot (b / (K^\top u_{k+1}))^{\omega}$

with $0 < \omega < 2/(1 + \Lambda(K))$ ensuring global convergence. The optimal value for local acceleration is

$\omega^{\text{opt}} = 2/(1 + \sqrt{1 - \mu_2^2})$

where $\mu_2$ is the subdominant eigenvalue of the local Jacobian. Online adaptive strategies estimate $\mu_2$ via residual tracking (Lehmann et al., 2020).

A Lyapunov-based global convergence guarantee for overrelaxed SK is established by monitoring the KL-divergence to the true solution and ensuring the monotonic decrease of this functional with a locally chosen $\omega$ (1711.01851).

Arnoldi and Krylov-Type Methods

For matrices with clustered dominant singular values (nearly reducible structure), the standard fixed-point SK iteration slows dramatically. Arnoldi-type accelerations and "power $\ell$ " schemes apply high-order or block Krylov iterations to the associated local nonlinear eigenvalue problem for the scaling factors (Aristodemo et al., 2018). By doing so, superlinear convergence rates can be achieved, particularly effective in the nearly decomposable regime.

5. Extensions to Structure and Large-Scale Computation

Constrained Optimal Transport and Zero Patterns

SK naturally accommodates structural constraints by encoding forbidden (zero) assignments directly into the kernel. Let $\mathcal{Z} \subset \{1, ..., m\} \times \{1, ..., n\}$ enforce $P_{ij} = 0$ for $(i,j) \in \mathcal{Z}$ . The constrained kernel $K$ is defined as in the unconstrained case but with $K_{ij} = 0$ for forbidden pairs. The scaling iterates are unchanged in form (Corless et al., 2024):

$u \leftarrow a / (K v), \quad v \leftarrow b / (K^\top u)$

The induced structure is preserved at every step, and convergence to the unique solution is guaranteed if a strictly positive feasible coupling exists.

High-Dimensional and Structured Fast Solvers

In high dimensions or where $K$ is a Gaussian or translation-invariant kernel, per-iteration complexity can be reduced to $O(n \log n)$ using the non-equally spaced Fast Fourier Transform (NFFT) or related convolutional accelerations (Lakshmanan et al., 2022). For measures supported on generic grids, NFFT approximates matrix-vector products required in SK at nearly linear cost.

In applications with extreme sparsity—such as Word Mover's Distance for massive vocabularies—SK is efficiently parallelized using sparse kernel fusion (sparse-dense-dense matrix multiplication compositions) and shared memory architectures, with demonstrated $O(V^2 t)$ scaling and strong parallel efficiency (Tithi et al., 2020).

6. Analytical Connections and Generalizations

Geometric and PDE interpretation: In the continuous setting, and under entropic regularization, the discrete SK iterates approximate the solution to a fully nonlinear parabolic Monge–Ampère PDE in the large sample limit (Berman, 2017). This establishes a deep connection to geometric flows such as (twisted) Ricci flows and modern transport theory.
Energy-minimization viewpoint: Both SK and related iterative scaling approaches are recovered as block coordinate descent algorithms for a convex logarithmic-barrier, or energy, functional in the "potential" variables (log-scalings) (Mazzilli et al., 2022).
Mirror descent and Bregman geometry: SK is a concrete instance of mirror descent with the negative entropy regularizer and, more generally, represents alternating Bregman projections in the space of row- and column-sum constraints (Mishchenko, 2019).
Equivalence to Fitness–Complexity: In economic complexity, the Fitness–Complexity algorithm is algebraically equivalent to SK with uniform marginals; both admit the same convergence and scale invariance properties (Mazzilli et al., 2022).

7. Derivative Convergence, Sensitivity, and Automatic Differentiation

It is established that not only the primal iterates of the SK algorithm but also their derivatives (w.r.t. all problem data: marginals, costs, regularization) converge linearly to the derivatives of the entropic optimal transport solution (Pauwels et al., 2022). This justifies automatic differentiation and backpropagation through truncated SK iterations in machine learning pipelines, guaranteeing accurate sensitivity and gradient information once a sufficient iteration depth is reached.

Key References:

(Chakrabarty et al., 2018): Error analysis and tight KL/p-norm convergence rates
(Lehmann et al., 2020, 1711.01851): Overrelaxation and adaptive accelerated SK
(Corless et al., 2024): Constrained/structured SK with zero-pattern
(Tithi et al., 2020, Lakshmanan et al., 2022): Parallelized and NFFT-accelerated Sinkhorn
(He, 13 Jul 2025): Complexity phase transition at matrix density $\gamma=1/2$
(Aristodemo et al., 2018): Krylov/Arnoldi-type accelerations
(Pauwels et al., 2022): Derivative convergence of Sinkhorn iterates
(Modin, 2023, Berman, 2017): Geometric, dynamical, and PDE perspectives
(Mazzilli et al., 2022): Equivalence between SK and Fitness–Complexity algorithms

The SK algorithm is thus a central, robust, and theoretically well-understood primitive for matrix scaling and entropic transport, adaptable to a wide variety of constraints, domains, and large-scale computational settings.