Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-negative Alternate Scaling (NASA)

Updated 21 April 2026
  • Non-negative Alternate Scaling (NASA) is an iterative method that transforms non-negative matrices into doubly stochastic matrices via alternating row and column normalizations.
  • The algorithm generalizes classical matrix scaling to incorporate Bregman divergences and non-negativity constraints, enabling applications in regularized optimal transport and statistical estimation.
  • Supported by the Sinkhorn–Knopp theorem, NASA exhibits geometric convergence with O(n²) operations per iteration, making it efficient for high-dimensional and structured data problems.

The Non-negative Alternate Scaling Algorithm (NASA), also known as the Sinkhorn algorithm or the Sinkhorn-Knopp iteration, is a foundational class of iterative methods for transforming a non-negative (or strictly positive) matrix into a doubly stochastic matrix via alternating normalizations. Generalizations of NASA play a central role in scaling problems, optimal transport, matrix analysis, and algorithmic applications, including polynomial-time permanent approximation, design matrix analysis, statistical estimation, and regularized optimal transport. NASA encompasses both classical matrix scaling and modern extensions to general Bregman divergences and non-automatic non-negativity constraints.

1. Problem Formulation and Definitions

Given a non-negative matrix AR0n×nA \in \mathbb{R}^{n \times n}_{\ge0}, doubly stochasticity requires that all row and column sums equal 1: j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j A key functional to measure deviation from a doubly stochastic matrix is

ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^2

where rir_i and cjc_j denote the ii-th row sum and jj-th column sum, respectively.

The ε\varepsilon-scaling problem asks: for given ε>0\varepsilon > 0, find diagonal B,C0B, C \succ 0 such that j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j0 (where possible). Scalability—existence of such scaling for every j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j1—is equivalent to positivity of the matrix permanent and to the bipartite support graph having a perfect matching (Garg et al., 2018).

For regularized optimal transport (ROT), the NASA is applied to solve problems of the form

j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j2

where j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j3 is the set of couplings (joint distributions) with prescribed marginals j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j4, j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j5 is a cost matrix, and j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j6 is a smooth convex regularizer, possibly not restricting the domain to the non-negative orthant (Dessein et al., 2016).

2. Iterative Algorithmic Structure

The classical NASA iterates between affine projections that alternately normalize rows and columns:

  • Row-scaling: For a matrix j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j7, define

j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j8

  • Column-scaling: Likewise,

j=1nAij=1i,i=1nAij=1j\sum_{j=1}^n A_{ij} = 1 \quad \forall i,\qquad \sum_{i=1}^n A_{ij} = 1 \quad \forall j9

The update sequence is then: ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^20 This sequence converges to a unique doubly stochastic scaling ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^21, where ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^22 are positive diagonals (Nathanson, 2019, Nathanson, 2019).

For convex regularized transport, NASA generalizes to compute Bregman projections in dual coordinates. When the regularizer’s domain does not guarantee non-negativity, explicit projections onto the non-negative orthant are required. Corrections (Dykstra’s algorithm) are maintained to ensure convergence when projecting cyclically onto the constraints:

  • ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^23 (non-negativity)
  • ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^24 (row marginals)
  • ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^25 (column marginals) NASA cycles as ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^26, with Newton-type inner steps for affine projections in the case of separable ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^27 (Dessein et al., 2016).

3. Convergence Theory and Complexity

The classical Sinkhorn–Knopp theorem asserts that for every strictly positive ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^28, the NASA produces positive diagonal matrices ds(A)=i=1n(ri1)2+j=1n(cj1)2\operatorname{ds}(A) = \sum_{i=1}^n (r_i - 1)^2 + \sum_{j=1}^n (c_j - 1)^29 such that rir_i0 is uniquely doubly stochastic. The iterative map is non-expansive in the Hilbert metric on the positive cone, and the process exhibits geometric (linear) convergence (Nathanson, 2019, Nathanson, 2019).

In the entropy-regularized setting, convergence is seen as monotonic minimization of the Kullback–Leibler divergence rir_i1 over diagonal rir_i2 scalings, and each iteration strictly decreases a potential. The number of NASA iterations needed to achieve rir_i3 is rir_i4, assuming the matrix entries have bit complexity rir_i5. Each iteration incurs rir_i6 arithmetic operations (Garg et al., 2018).

For general Bregman divergences, under mild technical conditions (Legendre type, cofinite, full domain for conjugate, etc.), Dykstra’s algorithm ensures unique Bregman projection convergence at a linear rate (Theorem 2.1 of (Dessein et al., 2016)). For separable rir_i7, each outer iteration has rir_i8 complexity, dominated by vectorized Newton updates. For non-separable regularizers (e.g., Mahalanobis), complexity grows to rir_i9. Empirical studies confirm practical quadratic scaling for moderate problem sizes (Dessein et al., 2016).

4. Algorithmic Pseudocode and Procedures

The concrete NASA scheme for matrix scaling (Sinkhorn’s algorithm) is as follows (Garg et al., 2018, Nathanson, 2019):

  1. If any row or column of cjc_j0 is all-zero, terminate and report “not scalable.”
  2. Initialize cjc_j1, cjc_j2.
  3. For cjc_j3:
    • Compute cjc_j4 and cjc_j5.
    • If cjc_j6, return cjc_j7.
    • If cjc_j8, set cjc_j9; else ii0.
  4. Output “not scalable” if no feasible scaling is found.

For regularized optimal transport:

  • NASA performs in the dual parameterization ii1, maintaining correction variables for each constraint.
  • Each cycle consists of projections (generally using Newton–Raphson for affine marginals), interleaved with non-negativity projections applied elementwise (Dessein et al., 2016).

Pseudocode details for both cases are explicitly provided in (Garg et al., 2018) and (Dessein et al., 2016).

5. Applications and Illustrative Use Cases

The NASA has a spectrum of applications across mathematics, statistics, and machine learning:

  • Permanent Approximation: Any ii2 with ii3 can be nearly doubly-stochastically scaled. Since ii4 for a doubly stochastic ii5 satisfies ii6, NASA delivers a deterministic ii7-approximation to ii8 in polynomial time (Garg et al., 2018).
  • Rank Bounds in Design Matrices: Scaling a design matrix to (near) doubly stochasticity provides lower bounds on matrix rank, crucial for quantitative Sylvester–Gallai theorems and improved bounds in combinatorial geometry (Garg et al., 2018).
  • Maximum-Entropy Estimation: The scaling problem is equivalent to entropy-regularized convex optimization for contingency tables and joint distribution estimation, with NASA iterations realized as block coordinate minimizations (Garg et al., 2018, Dessein et al., 2016).
  • Regularized Optimal Transport (ROT): In problems where regularization is not entropic, NASA enables efficient solution via explicit alternated projections, supporting sparsity, additional constraints, and richer regularization (Dessein et al., 2016). Experiments include synthetic 1D distributions (Gaussian and bimodal), runtime comparisons for various regularizers, and real-world audio scene classification with improved classification rates using the NASA-optimized kernel (Dessein et al., 2016).

6. Special Cases, Finite-Termination Phenomena, and Limitations

Classically, NASA achieves convergence only asymptotically. Recent work constructs explicit families of ii9 positive matrices that terminate after finitely many steps: for each jj0, the family of non-invertible, row-stochastic but not column-stochastic matrices constructed by Nathanson become doubly stochastic after a single column-scaling (yet no earlier) (Nathanson, 2019). All such examples have determinant zero, and it is open whether a strictly positive, invertible jj1 can have finite NASA termination.

Further, explicit Sinkhorn limits have been calculated for all symmetric jj2 matrices with two values, providing links to Diophantine approximation and sequences of rational numbers converging to algebraic irrationals via NASA iterates (Nathanson, 2019).

7. Practical Implementations and Advanced Considerations

NASA is well-suited for high-dimensional, structured, or sparse data due to its operation via vectorized row/column-wise updates, which are GPU/parallel-friendly (Dessein et al., 2016). For problems with infinite cost entries, NASA naturally restricts computation to active submatrices. Choice of regularization controls solution smoothness and sparsity, with larger jj3 promoting faster convergence and increased smoothing. Termination criteria are application-dependent, involving either direct error in margin deviation or iterations on dual variables.

In the separable case, practical performance matches the theoretical jj4 scaling, outperforming classical, combinatorial Earth Mover’s Distance solvers (which are super-cubic). For non-separable jj5, NASA remains applicable but computationally more intensive.

NASA provides a unifying approach to matrix scaling and regularized transport, extending the reach of classical iterative scaling to general convex frameworks and broad application domains (Garg et al., 2018, Dessein et al., 2016, Nathanson, 2019, Nathanson, 2019).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Non-negative Alternate Scaling Algorithm (NASA).