Per-Tensor Scaling: Methods & Applications

Updated 17 March 2026

Per-tensor scaling is the generalization of matrix scaling to multi-way arrays, enforcing prescribed marginal or subtensor constraints through multiplicative updates.
It unifies approaches in optimal transport, quantum information, and machine learning by solving nonlinear scaling equations with guaranteed geometric convergence.
The method achieves numerical stability and computational tractability via block-coordinate descent, low-rank approximations, and log-domain transformations.

Per-tensor scaling refers to the systematic rescaling of tensors so as to enforce prescribed marginal, slice-sum, or overall statistical constraints, typically by multiplicative (diagonal) transformations along the modes of a multi-way array. The methodology unifies and generalizes classical matrix scaling (diagonal row/column rescalings) to multidimensional arrays—enabling applications across optimal transport, quantum information, high-dimensional statistics, and tensor-based machine learning. While variations exist depending on context (e.g., marginal scaling, slice product scaling, per-tensor quantization), all such approaches share the objective of finding scaling factors (often one vector per mode or one per subtensor) such that the resultant tensor satisfies certain prescribed aggregate conditions along its modes or subtensors.

1. Fundamental Problem Formulations

The per-tensor scaling paradigm encompasses several important models:

Multi-marginal scaling: Given a nonnegative $d$ -way tensor $T\in\mathbb{R}_+^{n\times \cdots\times n}$ and prescribed “one-mode” marginals $p^{(k)}\in\Delta^n$ for each mode $k=1,\ldots,d$ , the task is to find positive scaling vectors $u^{(k)}$ such that $T' = T\odot (\otimes_{k=1}^d u^{(k)})$ satisfies the marginal constraints for all $k$ (sum over all indices except the $k$ th yields $p^{(k)}$ ) (Friedland, 2020, Friedland, 2019).
Subtensor-scaled constraints: For a tensor $A$ with target products or sums across $k$ -dimensional subtensors (organized into families), find positive scaling multipliers $M_s^\alpha$ that, when applied to the relevant subtensors, enforce the product (or sum) constraint within each family (Nguyen et al., 2020).
Quantum marginal scaling: In quantum information, the problem is to scale a complex tensor via local invertible linear transformations such that the reduced density matrices (marginals) match specified spectra, subject to feasibility via the moment polytope (Bürgisser et al., 2018).

In all cases, the per-tensor scaling problem can be encoded as a system of (typically nonlinear) equations—most commonly via multiplicative updates in the tensor elements, or as closed-form projections in a transformed (usually logarithmic) space.

2. Optimization Frameworks and Algorithms

Key theoretical advances have established that per-tensor scaling problems are equivalent to the block-coordinate partial minimization of strictly convex objectives:

In the entropic-regularized multi-marginal optimal transport framework (Friedland, 2020, Friedland, 2019), define

$F(y) = \sum_{i_1,\ldots,i_d} A_{i_1\ldots i_d} \exp\left(y_{i_1}^{(1)} + \cdots + y_{i_d}^{(d)}\right) - \sum_{k=1}^d \langle p^{(k)}, y^{(k)} \rangle$

for $A_{i_1\ldots i_d} = \exp(-C_{i_1\ldots i_d}/\epsilon)$ . The function $F$ is strictly convex on the affine subspace where $\sum_{i} y_i^{(k)} = 0$ for all $k$ . Alternating minimization over each block $y^{(k)}$ yields multiplicative update rules (a higher-order variant of Sinkhorn scaling) with guaranteed linear convergence.

The “canonical tensor scaling" problem (Nguyen et al., 2020) is cast as the quadratic program

$\min_{x}\; \frac{1}{2}\sum_{a\in \Omega(A)} (x(a) - a(a))^2$

subject to linear equality constraints on sums over the supports of subtensors, and the solution is obtained by repeated projections (block-coordinate descent) onto each constraint set.

In the quantum marginal context (Bürgisser et al., 2018), alternating application of upper-triangular (Borel) group actions on each mode, along with projective and Cholesky-based updates, ensures the marginals approach the prescribed spectra, with analysis based on highest-weight vector potential functions.

In all settings above, direct multiplicative updates in the original tensor space (e.g., $u^{(k)} \gets p^{(k)}/(A\times_{\ell\neq k} u^{(\ell)})$ ) or additive updates in the log-domain converge geometrically to the unique scaling solution, provided natural positivity and compatibility conditions are met.

3. Algorithmic Complexity, Convergence, and Practical Considerations

The per-tensor scaling methods are computationally tractable for moderate tensor orders/dimensions but raise complexity challenges in the high-dimensional regime:

Complexity per sweep: For $d$ -mode, $n$ -dimensional tensors, a full update cycle costs $O(d n^d)$ arithmetic operations, with the primary cost incurred by marginalization (contraction) over $d-1$ modes (Friedland, 2020, Friedland, 2019). For canonical scaling over all $k$ -dimensional subtensors, the per-sweep cost is $O(\binom{d}{k} |\Omega(A)|)$ , where $|\Omega(A)|$ is the number of nonzero entries (Nguyen et al., 2020).
Convergence guarantees: Provided the cost tensor $A$ and marginals $p^{(k)}$ are strictly positive, the coordinate-descent algorithm achieves geometric (linear) convergence, with error reduced by a fixed fraction in each sweep. Explicit bounds depend on uniform spectral bounds of the Hessian of the convex objective and degrade with increasing $d$ (i.e., contraction per cycle is $1 - O(1/(d\kappa))$ for conditioning parameter $\kappa$ ) (Friedland, 2020, Friedland, 2019).
Numerical stability: To avoid numerical underflow, especially for small regularization parameter $\epsilon$ , computations are performed in the log domain using LogSumExp routines. Careful selection of $\epsilon$ is required to trade off between approximation quality (small $\epsilon$ better approximates unregularized OT) and numerical stability (large $\epsilon$ improves conditioning) (Friedland, 2020).
Sparsity and low-rank structure: High-dimensional or sparse tensors can be made tractable via the exploitation of underlying tensor train or low-rank factorizations, allowing efficient computation of marginal spectra and updates (Bürgisser et al., 2018).

4. Specializations: From Matrix Scaling to General Tensor Scaling

The per-tensor scaling framework generalizes classical matrix scaling (Sinkhorn–Knopp algorithm) to arbitrary tensors:

Case $d=2$ : The contraction reduces to alternating row and column scaling, and the method coincides with classical matrix diagonal scaling: find positive diagonal matrices $D_1, D_2$ such that $D_1 A D_2$ has prescribed row and column sums.
Higher-order tensors ( $d\geq 3$ ): The primary innovations are in (i) the necessity to contract over multiple modes for marginal constraints, (ii) the increased dimensionality of the block variables, and (iii) more stringent spectral requirements for geometric convergence due to degraded Hessian conditioning (Friedland, 2020, Friedland, 2019, Friedland, 2019).
Partial subtensor scaling: In canonical tensor scaling, constraints may apply only to certain subtensor families, e.g., all $k$ -slices, rather than all mode-marginals. The scaling factors are then associated to subtensors rather than modes, and the solution is structurally unique up to residual invariances (Nguyen et al., 2020).
Quantum marginals and moment polytopes: The scaling paradigm also subsumes quantum marginal and moment polytope problems, where the feasibility is determined by geometric invariant theory and the existence of a point in the associated moment polytope (Bürgisser et al., 2018).

5. Practical Applications and Extensions

Per-tensor scaling is a foundational tool across several domains:

Optimal transport: Multi-marginal OT with entropic regularization is computable via per-tensor scaling, providing distances between sets of measures and sampling joint distributions matching marginals (Friedland, 2020).
Quantum information and representation theory: Weak membership for moment polytopes, the quantum marginal problem, and entanglement polytopes are efficiently solvable using per-tensor scaling algorithms, with provable polynomial complexity (Bürgisser et al., 2018).
Statistical tensor models: Scaling strategies are essential in extensions of matrix completion to sparse tensor completion and recommender systems involving multi-way data (user/item attributes etc.) (Nguyen et al., 2020).
Tensor network and renormalization group methods: In numerical RG, per-tensor scaling encodes renormalization prescriptions, enables extraction of scaling dimensions, and systematically removes local gauge and corner-doubled-line structures (Lyu et al., 2021).
Neural network quantization: In per-tensor quantization for large models, scalars are applied across entire tensors to permit fast, low-precision arithmetic on hardware accelerators (Zhang et al., 2024). Here the method includes additional flattening procedures to mitigate outlier-induced scaling inefficiencies.

6. Technical Challenges, Tradeoffs, and Theoretical Guarantees

While per-tensor scaling affords robust theoretical guarantees (uniqueness, geometric convergence, feasibility conditions), several subtleties arise:

Trade-off between structure and computation: As $d$ or the rank of subtensor constraints increases, computational cost per iteration and number of constraints grow combinatorially; this motivates the need for scalable structure-aware implementations (Nguyen et al., 2020).
Feasibility and uniqueness: Existence of the scaling solution requires that marginal or block-product constraints are compatible; in the quantum marginal/scaling dimension context, this is equivalent to the non-emptiness of the moment polytope, itself decidable via capacity-theoretic or highest-weight criteria (Bürgisser et al., 2018).
Objective function selection: Both quadratic (Euclidean) and Kullback–Leibler (relative-entropy) objectives have been successfully used, yielding different algorithmic perspectives (e.g., block-projection vs. Bregman projection) (Nguyen et al., 2020).
Convergence rate: Linear convergence constants depend on the spectral gap in the Hessian of the convex objective; as $d$ increases, the rate diminishes, but geometric convergence persists with suitable initialization and step-size management (Friedland, 2020, Friedland, 2019).

7. Recent Algorithmic Innovations and High-Performance Implementations

Contemporary advancements include:

Polynomial-time guarantees: For general tensor-scaling with prescribed marginals, polynomial-time convergence in the bit-size and accuracy parameter is established via potential function analysis, notably by introducing highest-weight vector polynomials from representation theory as algorithmic progress measures (Bürgisser et al., 2018).
Hardware-oriented quantization: In per-tensor quantization for LLMs, methods such as FlattenQuant implement channel-wise flattening and per-tensor scaling to permit efficient INT4/INT8 arithmetic, achieving near-optimal speedup and memory reduction with accuracy comparable to full-precision baselines (Zhang et al., 2024).
RG-based scaling dimensions: In tensor network RG, linearization of the per-tensor RG map and the subsequent eigen-decomposition enables nonperturbative calculation of critical exponents and scaling dimensions, providing a canonical route from coarse-graining rules to universal physics (Lyu et al., 2021).

In summary, per-tensor scaling provides a comprehensive, theoretically grounded, and computationally efficient methodology for enforcing aggregate constraints on tensor-valued data across a variety of mathematical, physical, and engineering disciplines. Its centrality arises from the canonical role of tensor reweighting in optimal transport, quantum marginal inference, structured tensor completion, RG flow analysis, and contemporary deep learning quantization (Friedland, 2020, Friedland, 2019, Nguyen et al., 2020, Bürgisser et al., 2018, Lyu et al., 2021, Zhang et al., 2024).