Papers
Topics
Authors
Recent
2000 character limit reached

Sinkhorn–Knopp Algorithm

Updated 5 January 2026
  • Sinkhorn–Knopp Algorithm is an iterative method that scales matrices to target row and column sums, achieving doubly stochastic properties under suitable conditions.
  • It alternates between row and column rescaling, leveraging convex optimization and mirror descent principles to ensure convergence with provable error bounds.
  • Accelerated variants using overrelaxation and Krylov-type methods improve efficiency, particularly for high-dimensional and sparse matrix problems.

The Sinkhorn–Knopp (SK) algorithm is a foundational iterative procedure for scaling the rows and columns of a nonnegative matrix to achieve prescribed sums. Originally developed to construct doubly stochastic matrices, it has evolved into an indispensable tool in matrix analysis, entropic optimal transport, numerical linear algebra, economic complexity, and computational geometry. The method alternates between rescaling rows and columns and admits elegant connections to convex optimization, information geometry, and dynamical systems. Its convergence guarantees, phase complexity, extensions to structural constraints, and diverse algorithmic accelerations have produced a rich and actively developing body of literature.

1. Formal Definition, Iterative Scheme, and Existence

Given a nonnegative matrix AR+n×mA\in\mathbb{R}_+^{n\times m} and target row sum vector rR>0nr\in\mathbb{R}_{>0}^n, column sum vector cR>0mc\in\mathbb{R}_{>0}^m with r1=c1=h\|r\|_1 = \|c\|_1 = h, the matrix scaling problem is to find diagonal matrices R,SR, S such that RASR A S has row sums rr and column sums cc. A matrix is said to be (r,c)(r,c)-scalable if such R,SR,S exist. The classical Sinkhorn–Knopp iteration operates as follows (Chakrabarty et al., 2018):

  • Initialization (column scaling):

Aij(0):=AijcjiAijA^{(0)}_{ij} := A_{ij} \cdot \frac{c_j}{\sum_{i'} A_{i'j}}

  • Row-scaling (step tt):

Bij(t):=Aij(t)rijAij(t)B^{(t)}_{ij} := A^{(t)}_{ij} \cdot \frac{r_i}{\sum_{j'} A^{(t)}_{ij'}}

  • Column-scaling (step tt):

Aij(t+1):=Bij(t)cjiBij(t)A^{(t+1)}_{ij} := B^{(t)}_{ij} \cdot \frac{c_j}{\sum_{i'} B^{(t)}_{i'j}}

This process alternates until both marginal constraints are met within a specified tolerance.

For strictly positive AA, convergence to the unique scaling is guaranteed (up to normalization) by the Sinkhorn theorem and follows from Perron–Frobenius theory (Modin, 2023). For matrices with zeros, (r,c)(r,c)-scalability must be verified on the support graph.

2. Interpretation as Alternating Bregman Projections and Mirror Descent

The SK iteration is a manifestation of alternating Bregman projections (specifically, I-projections for KL-divergence) onto the affine spaces defined by the row and column sum constraints. In the context of entropic optimal transport, the problem may be expressed as (Mishchenko, 2019, Modin, 2023):

minX0  C,X+ϵi,jXijlogXijs.t.X1=r,X1=c\min_{X \ge 0} \; \langle C, X \rangle + \epsilon \sum_{i,j} X_{ij} \log X_{ij} \quad \text{s.t.} \quad X\mathbf{1} = r, \, X^\top \mathbf{1} = c

Defining the Gibbs kernel Kij=exp(Cij/ϵ)K_{ij} = \exp(-C_{ij}/\epsilon), the unique minimizer admits the factorization X=diag(u)Kdiag(v)X^* = \operatorname{diag}(u) K \operatorname{diag}(v), with u,v>0u, v > 0.

Under the mirror-descent framework, with negative entropy as the mirror map and KL-divergence for the penalty, the SK updates are recovered as stochastic mirror descent steps. Each iteration corresponds to a Bregman projection onto the respected marginal constraints, yielding closed-form multiplicative rescalings (Mishchenko, 2019). The process generalizes to multidimensional arrays and admits acceleration by considering full gradients or alternating orderings of projections.

3. Convergence Guarantees, Error Bounds, and Complexity Phase Transition

The convergence of the SK algorithm is quantified via KL-divergence between iterates and the target marginals. The key result for (r,c)(r,c)-scalable AA is (Chakrabarty et al., 2018):

  • For any δ>0\delta > 0, after at most

T=ln(1+2Δρ/ν)δT = \left\lceil \frac{\ln(1 + 2 \Delta \rho / \nu)}{\delta} \right\rceil

alternating steps, at least one of the KL-divergences DKL(r~r~(t))D_{KL}(\tilde r^* || \tilde r^{(t)}) or DKL(c~c~(t))D_{KL}(\tilde c^* || \tilde c^{(t)}) falls below δ\delta. Here, ρ=maxiri\rho = \max_i r_i, ν=minAij>0Aij/maxijAij\nu = \min_{A_{ij} > 0} A_{ij} / \max_{ij} A_{ij}, Δ=maxj{i:Aij>0}\Delta = \max_j | \{ i : A_{ij} > 0 \} |.

  • 1\ell_1 and 2\ell_2 error bounds are obtained via Pinsker's inequality and a new KL–1/2\ell_1/\ell_2 bound. For 1\ell_1, T=O(h2ln(Δρ/ν)/ϵ12)T = O(h^2 \ln(\Delta\rho/\nu) / \epsilon_1^2) suffices for r(t)r1ϵ1\| r^{(t)} - r \|_1 \leq \epsilon_1. For 2\ell_2, using the stronger bound, T=O(ρhln(Δρ/ν)(1/ϵ2+1/ϵ22))T = O(\rho h \ln(\Delta\rho/\nu) (1/\epsilon_2 + 1/\epsilon_2^2)) suffices for r(t)r2ϵ2\| r^{(t)} - r \|_2 \leq \epsilon_2.

A fundamental discovery is the phase transition in complexity based on the matrix density γ\gamma (He, 13 Jul 2025). For normalized AA with density γ>1/2\gamma > 1/2 (i.e., each row and column has at least γn\gamma n significant entries), SK achieves

O(lognlogε)O(\log n - \log \varepsilon)

iterations and O~(n2)\tilde O(n^2) total time to reach marginal error ε\varepsilon. For γ<1/2\gamma < 1/2, pathological constructions require

Ω~(n/ε)\tilde \Omega(\sqrt{n} / \varepsilon)

iterations for 2\ell_2 error, with a sharp threshold at γ=1/2\gamma = 1/2. Thus, on dense matrices SK rapidly converges, while in the sparse regime it is intrinsically slow.

4. Algorithmic Variants and Accelerations

Overrelaxation

Overrelaxed SK algorithms introduce a relaxation parameter ω>1\omega > 1 to accelerate local convergence (Lehmann et al., 2020, 1711.01851):

uk+1=uk1ω(a/(Kvk))ω,vk+1=vk1ω(b/(Kuk+1))ωu_{k+1} = u_k^{1 - \omega} \odot (a / (K v_k))^{\omega}, \quad v_{k+1} = v_k^{1 - \omega} \odot (b / (K^\top u_{k+1}))^{\omega}

with 0<ω<2/(1+Λ(K))0 < \omega < 2/(1 + \Lambda(K)) ensuring global convergence. The optimal value for local acceleration is

ωopt=2/(1+1μ22)\omega^{\text{opt}} = 2/(1 + \sqrt{1 - \mu_2^2})

where μ2\mu_2 is the subdominant eigenvalue of the local Jacobian. Online adaptive strategies estimate μ2\mu_2 via residual tracking (Lehmann et al., 2020).

A Lyapunov-based global convergence guarantee for overrelaxed SK is established by monitoring the KL-divergence to the true solution and ensuring the monotonic decrease of this functional with a locally chosen ω\omega (1711.01851).

Arnoldi and Krylov-Type Methods

For matrices with clustered dominant singular values (nearly reducible structure), the standard fixed-point SK iteration slows dramatically. Arnoldi-type accelerations and "power \ell" schemes apply high-order or block Krylov iterations to the associated local nonlinear eigenvalue problem for the scaling factors (Aristodemo et al., 2018). By doing so, superlinear convergence rates can be achieved, particularly effective in the nearly decomposable regime.

5. Extensions to Structure and Large-Scale Computation

Constrained Optimal Transport and Zero Patterns

SK naturally accommodates structural constraints by encoding forbidden (zero) assignments directly into the kernel. Let Z{1,...,m}×{1,...,n}\mathcal{Z} \subset \{1, ..., m\} \times \{1, ..., n\} enforce Pij=0P_{ij} = 0 for (i,j)Z(i,j) \in \mathcal{Z}. The constrained kernel KK is defined as in the unconstrained case but with Kij=0K_{ij} = 0 for forbidden pairs. The scaling iterates are unchanged in form (Corless et al., 2024):

ua/(Kv),vb/(Ku)u \leftarrow a / (K v), \quad v \leftarrow b / (K^\top u)

The induced structure is preserved at every step, and convergence to the unique solution is guaranteed if a strictly positive feasible coupling exists.

High-Dimensional and Structured Fast Solvers

In high dimensions or where KK is a Gaussian or translation-invariant kernel, per-iteration complexity can be reduced to O(nlogn)O(n \log n) using the non-equally spaced Fast Fourier Transform (NFFT) or related convolutional accelerations (Lakshmanan et al., 2022). For measures supported on generic grids, NFFT approximates matrix-vector products required in SK at nearly linear cost.

In applications with extreme sparsity—such as Word Mover's Distance for massive vocabularies—SK is efficiently parallelized using sparse kernel fusion (sparse-dense-dense matrix multiplication compositions) and shared memory architectures, with demonstrated O(V2t)O(V^2 t) scaling and strong parallel efficiency (Tithi et al., 2020).

6. Analytical Connections and Generalizations

  • Geometric and PDE interpretation: In the continuous setting, and under entropic regularization, the discrete SK iterates approximate the solution to a fully nonlinear parabolic Monge–Ampère PDE in the large sample limit (Berman, 2017). This establishes a deep connection to geometric flows such as (twisted) Ricci flows and modern transport theory.
  • Energy-minimization viewpoint: Both SK and related iterative scaling approaches are recovered as block coordinate descent algorithms for a convex logarithmic-barrier, or energy, functional in the "potential" variables (log-scalings) (Mazzilli et al., 2022).
  • Mirror descent and Bregman geometry: SK is a concrete instance of mirror descent with the negative entropy regularizer and, more generally, represents alternating Bregman projections in the space of row- and column-sum constraints (Mishchenko, 2019).
  • Equivalence to Fitness–Complexity: In economic complexity, the Fitness–Complexity algorithm is algebraically equivalent to SK with uniform marginals; both admit the same convergence and scale invariance properties (Mazzilli et al., 2022).

7. Derivative Convergence, Sensitivity, and Automatic Differentiation

It is established that not only the primal iterates of the SK algorithm but also their derivatives (w.r.t. all problem data: marginals, costs, regularization) converge linearly to the derivatives of the entropic optimal transport solution (Pauwels et al., 2022). This justifies automatic differentiation and backpropagation through truncated SK iterations in machine learning pipelines, guaranteeing accurate sensitivity and gradient information once a sufficient iteration depth is reached.


Key References:

The SK algorithm is thus a central, robust, and theoretically well-understood primitive for matrix scaling and entropic transport, adaptable to a wide variety of constraints, domains, and large-scale computational settings.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Sinkhorn–Knopp Algorithm.