Extend orthogonalization results of the isotropic curvature model to the case m < n

Prove rigorous extensions of the orthogonalization results for the isotropic curvature model optimization program min_Q [-Tr(Q G^T) + E_{ζ uniformly distributed on the unit sphere} H(∥Q ζ∥)] to the rectangular case with fewer rows than columns (m < n). Specifically, (i) establish that, under Assumption 3 on the curvature function H (a kink at radius r̃ with left-derivative A small and right-derivative B large) and full-rank gradient G, there exists an optimal solution of the form Q* = c U V^T (i.e., a scalar multiple of the unitary factor from the polar decomposition of G), and (ii) show the converse necessity that if such an orthogonalized solution is optimal for a non-scaled-orthonormal G, then H must have a kink; provide complete, detailed proofs and precise statements that account for the lack of norm preservation in all directions when m < n.

Background

The paper introduces the isotropic curvature model to analyze matrix-gradient updates via the convex program min_Q [-Tr(Q G^T) + E H(∥Q ζ∥)], where ζ is uniform on the unit sphere and H encodes curvature growth. Under a strong growth condition for H with a kink (Assumption 3) and assuming m ≥ n with full-rank G, Theorem 4 proves that the optimal update is an orthogonalized gradient Q* = c U V^T (the unitary factor from G’s polar decomposition), and Proposition 6 shows the necessity of the kink if such an orthogonal solution is optimal for a non-scaled-orthonormal G.

The authors note that both Theorem 4 (orthogonalization optimality) and Proposition 6 (necessity of a kink) are proved under m ≥ n because Q with scaled orthonormal columns preserves norms for all unit vectors in this case. They argue the results should extend to m < n, where Q preserves norms only on an m-dimensional subspace, and where concentration of measure could be used, but they do not provide the detailed proofs and leave them for future work.

References

Both Theorem~\ref{thm:orth} and Proposition~\ref{prop:orth_converse} assume $m \ge n$. The two results can be extended to the case $m < n$, but the statements might involve approximations and might not be as precise. Roughly speaking, the proof idea should continue to work by recognizing that $Q$ preserves the norm in a subspace of dimension $m$. Moreover, for sufficiently large $m$, concentration of measure ensures that the first $m$ components of $\zeta$ are approximately sampled from a sphere in $\R^m$. We leave the detailed proofs for this extension to future work.

— Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? (2511.00674 - Su, 1 Nov 2025) in Remark, Section 4.2 (Proofs for Section 3.3 Orthogonalization)

Extend orthogonalization results of the isotropic curvature model to the case m < n

Background

References

Related Problems