Non-negative Alternate Scaling (NASA)
- Non-negative Alternate Scaling (NASA) is an iterative method that transforms non-negative matrices into doubly stochastic matrices via alternating row and column normalizations.
- The algorithm generalizes classical matrix scaling to incorporate Bregman divergences and non-negativity constraints, enabling applications in regularized optimal transport and statistical estimation.
- Supported by the Sinkhorn–Knopp theorem, NASA exhibits geometric convergence with O(n²) operations per iteration, making it efficient for high-dimensional and structured data problems.
The Non-negative Alternate Scaling Algorithm (NASA), also known as the Sinkhorn algorithm or the Sinkhorn-Knopp iteration, is a foundational class of iterative methods for transforming a non-negative (or strictly positive) matrix into a doubly stochastic matrix via alternating normalizations. Generalizations of NASA play a central role in scaling problems, optimal transport, matrix analysis, and algorithmic applications, including polynomial-time permanent approximation, design matrix analysis, statistical estimation, and regularized optimal transport. NASA encompasses both classical matrix scaling and modern extensions to general Bregman divergences and non-automatic non-negativity constraints.
1. Problem Formulation and Definitions
Given a non-negative matrix , doubly stochasticity requires that all row and column sums equal 1: A key functional to measure deviation from a doubly stochastic matrix is
where and denote the -th row sum and -th column sum, respectively.
The -scaling problem asks: for given , find diagonal such that 0 (where possible). Scalability—existence of such scaling for every 1—is equivalent to positivity of the matrix permanent and to the bipartite support graph having a perfect matching (Garg et al., 2018).
For regularized optimal transport (ROT), the NASA is applied to solve problems of the form
2
where 3 is the set of couplings (joint distributions) with prescribed marginals 4, 5 is a cost matrix, and 6 is a smooth convex regularizer, possibly not restricting the domain to the non-negative orthant (Dessein et al., 2016).
2. Iterative Algorithmic Structure
The classical NASA iterates between affine projections that alternately normalize rows and columns:
- Row-scaling: For a matrix 7, define
8
- Column-scaling: Likewise,
9
The update sequence is then: 0 This sequence converges to a unique doubly stochastic scaling 1, where 2 are positive diagonals (Nathanson, 2019, Nathanson, 2019).
For convex regularized transport, NASA generalizes to compute Bregman projections in dual coordinates. When the regularizer’s domain does not guarantee non-negativity, explicit projections onto the non-negative orthant are required. Corrections (Dykstra’s algorithm) are maintained to ensure convergence when projecting cyclically onto the constraints:
- 3 (non-negativity)
- 4 (row marginals)
- 5 (column marginals) NASA cycles as 6, with Newton-type inner steps for affine projections in the case of separable 7 (Dessein et al., 2016).
3. Convergence Theory and Complexity
The classical Sinkhorn–Knopp theorem asserts that for every strictly positive 8, the NASA produces positive diagonal matrices 9 such that 0 is uniquely doubly stochastic. The iterative map is non-expansive in the Hilbert metric on the positive cone, and the process exhibits geometric (linear) convergence (Nathanson, 2019, Nathanson, 2019).
In the entropy-regularized setting, convergence is seen as monotonic minimization of the Kullback–Leibler divergence 1 over diagonal 2 scalings, and each iteration strictly decreases a potential. The number of NASA iterations needed to achieve 3 is 4, assuming the matrix entries have bit complexity 5. Each iteration incurs 6 arithmetic operations (Garg et al., 2018).
For general Bregman divergences, under mild technical conditions (Legendre type, cofinite, full domain for conjugate, etc.), Dykstra’s algorithm ensures unique Bregman projection convergence at a linear rate (Theorem 2.1 of (Dessein et al., 2016)). For separable 7, each outer iteration has 8 complexity, dominated by vectorized Newton updates. For non-separable regularizers (e.g., Mahalanobis), complexity grows to 9. Empirical studies confirm practical quadratic scaling for moderate problem sizes (Dessein et al., 2016).
4. Algorithmic Pseudocode and Procedures
The concrete NASA scheme for matrix scaling (Sinkhorn’s algorithm) is as follows (Garg et al., 2018, Nathanson, 2019):
- If any row or column of 0 is all-zero, terminate and report “not scalable.”
- Initialize 1, 2.
- For 3:
- Compute 4 and 5.
- If 6, return 7.
- If 8, set 9; else 0.
- Output “not scalable” if no feasible scaling is found.
For regularized optimal transport:
- NASA performs in the dual parameterization 1, maintaining correction variables for each constraint.
- Each cycle consists of projections (generally using Newton–Raphson for affine marginals), interleaved with non-negativity projections applied elementwise (Dessein et al., 2016).
Pseudocode details for both cases are explicitly provided in (Garg et al., 2018) and (Dessein et al., 2016).
5. Applications and Illustrative Use Cases
The NASA has a spectrum of applications across mathematics, statistics, and machine learning:
- Permanent Approximation: Any 2 with 3 can be nearly doubly-stochastically scaled. Since 4 for a doubly stochastic 5 satisfies 6, NASA delivers a deterministic 7-approximation to 8 in polynomial time (Garg et al., 2018).
- Rank Bounds in Design Matrices: Scaling a design matrix to (near) doubly stochasticity provides lower bounds on matrix rank, crucial for quantitative Sylvester–Gallai theorems and improved bounds in combinatorial geometry (Garg et al., 2018).
- Maximum-Entropy Estimation: The scaling problem is equivalent to entropy-regularized convex optimization for contingency tables and joint distribution estimation, with NASA iterations realized as block coordinate minimizations (Garg et al., 2018, Dessein et al., 2016).
- Regularized Optimal Transport (ROT): In problems where regularization is not entropic, NASA enables efficient solution via explicit alternated projections, supporting sparsity, additional constraints, and richer regularization (Dessein et al., 2016). Experiments include synthetic 1D distributions (Gaussian and bimodal), runtime comparisons for various regularizers, and real-world audio scene classification with improved classification rates using the NASA-optimized kernel (Dessein et al., 2016).
6. Special Cases, Finite-Termination Phenomena, and Limitations
Classically, NASA achieves convergence only asymptotically. Recent work constructs explicit families of 9 positive matrices that terminate after finitely many steps: for each 0, the family of non-invertible, row-stochastic but not column-stochastic matrices constructed by Nathanson become doubly stochastic after a single column-scaling (yet no earlier) (Nathanson, 2019). All such examples have determinant zero, and it is open whether a strictly positive, invertible 1 can have finite NASA termination.
Further, explicit Sinkhorn limits have been calculated for all symmetric 2 matrices with two values, providing links to Diophantine approximation and sequences of rational numbers converging to algebraic irrationals via NASA iterates (Nathanson, 2019).
7. Practical Implementations and Advanced Considerations
NASA is well-suited for high-dimensional, structured, or sparse data due to its operation via vectorized row/column-wise updates, which are GPU/parallel-friendly (Dessein et al., 2016). For problems with infinite cost entries, NASA naturally restricts computation to active submatrices. Choice of regularization controls solution smoothness and sparsity, with larger 3 promoting faster convergence and increased smoothing. Termination criteria are application-dependent, involving either direct error in margin deviation or iterations on dual variables.
In the separable case, practical performance matches the theoretical 4 scaling, outperforming classical, combinatorial Earth Mover’s Distance solvers (which are super-cubic). For non-separable 5, NASA remains applicable but computationally more intensive.
NASA provides a unifying approach to matrix scaling and regularized transport, extending the reach of classical iterative scaling to general convex frameworks and broad application domains (Garg et al., 2018, Dessein et al., 2016, Nathanson, 2019, Nathanson, 2019).