Optimal Projection-Free Adaptive SGD for Matrix Optimization

Published 2 Apr 2026 in math.OC and cs.LG | (2604.02505v1)

Abstract: Recently, Jiang et al. [2026] developed Leon, a practical variant of One-sided Shampoo [Xie et al., 2025a, An et al., 2025] algorithm for online convex optimization, which does not require computing a costly quadratic projection at each iteration. Unfortunately, according to the existing analysis, Leon requires tuning an additional hyperparameter in its preconditioner and cannot achieve dimension-independent convergence guarantees for convex optimization problems beyond the bounded gradients assumption. In this paper, we resolve this issue by proving certain stability properties of Leon's preconditioner. Using our improved analysis, we show that tuning the extra hyperparameter can be avoided and, more importantly, develop the first practical variant of One-sided Shampoo with Nesterov acceleration, which does not require computing projections at each iteration. As a side contribution, we obtain improved dimension-independent rates in the non-smooth non-convex setting and develop a unified analysis of the proposed algorithm, which yields accelerated projection-free adaptive SGD with (block-)diagonal preconditioners.

Abstract PDF Upgrade to Chat

Authors (1)

Dmitry Kovalev

Summary

The paper eliminates hyperparameter tuning by leveraging a new gradient stability property, enabling projection-free adaptive SGD without dimension-dependent penalties.
The paper introduces the first accelerated projection-free One-sided Shampoo variant that achieves optimal iteration complexity for nu-Hölder smooth convex objectives.
The paper unifies the analysis for various preconditioning schemes, broadening the applicability of matrix-adaptive methods in high-dimensional deep learning settings.

Optimal Projection-Free Adaptive SGD for Matrix Optimization: An Expert Summary

Overview and Motivation

This paper addresses fundamental limitations of matrix-preconditioned adaptive stochastic optimization methods—particularly One-sided Shampoo and its recent projection-free variant, Leon—for online convex and non-convex optimization. The work delivers a dimension-independent, hyperparameter-free, projection-free analysis of adaptive SGD with matrix preconditioning. It further introduces the first Nesterov-accelerated, projection-free variant of One-sided Shampoo, with optimal iteration complexity for convex $\nu$ -Hölder smooth objectives.

Background: Structured Adaptive Methods and Projection-Free Variants

Most practical deep learning optimization pipelines rely on coordinatewise adaptive methods such as AdaGrad and Adam; these adjust per-parameter learning rates using diagonal preconditioners. More recent structured preconditioners, as in One-sided Shampoo, capture richer curvature, allowing adaptation to low-rank or block-diagonal structures prevalent in neural networks, leveraging the spectral norm geometry for matrix-parameter spaces (Xie et al., 13 Mar 2025, An et al., 26 Mar 2025). However, such methods typically require constraining iterates within a norm ball via expensive quadratic projections; in One-sided Shampoo, the spectral norm projection is a computational bottleneck.

Leon, recently proposed as a projection-free analog of One-sided Shampoo via an FTRL update, circumvents this projection (Jiang et al., 9 Feb 2026). However, its theoretical guarantees require tuning an extra regularization hyperparameter and only yield dimension-independent rates under strong gradient boundedness assumptions—a severe restriction in stochastic or non-smooth contexts.

Main Contributions

The paper addresses these deficiencies in three major aspects:

Elimination of Hyperparameter Tuning in Projection-Free Leon: The authors establish a new gradient stability property for Leon’s matrix preconditioner and leverage it to show that the regularizer parameter can be sent to zero, removing the need for tuning. This yields dimension-independent convergence guarantees for Leon in both online and non-smooth non-convex optimization regimes, even without strict gradient norm constraints.
First Accelerated Projection-Free One-sided Shampoo: By extending UniXGrad-style gradient difference accumulation, the authors construct an accelerated projection-free variant of FTRL-Leon, achieving the optimal accelerated complexity $\mathcal{O}(\epsilon^{-2/(1+3\nu)})$ for $\nu$ -Hölder smooth convex objectives, paralleling the complexity of Nesterov-accelerated methods in non-Euclidean geometries but now in a projection-free, preconditioned context.
Unified Analysis Supporting General Preconditioning: The analysis framework is generic, supporting block-diagonal, diagonal, and even scalar step-size preconditioners. This greatly broadens the applicability of the theoretical results to a variety of adaptive methods beyond just Shampoo variants.

Technical Approach

The paper builds from FTRL with matrix regularizers. Crucially, the dual regularizer $\Psi_k^*(m)$ is parameterized via the squared accumulated preconditioned gradients within a suitable self-adjoint operator subspace, generalizing diagonal and matrix preconditioning. The projection-free update is defined as $x_{k+1} = -\nabla \Psi_k^*(m_k)$ , avoiding explicit projections onto spectral norm balls.

The key innovation is a lemma proving gradient stability for Leon’s preconditioner—bounding the change in the regularization gradient under incremental updates. This property enables sharper regret analyses, revealing that the additional regularization parameter $\delta$ can be sent to $0^+$ without incurring dimension-dependent regret or complexity terms, and without relying on uniform gradient bounds.

For Nesterov acceleration, the work designs an accelerated update for FTRL-Leon, using gradient differences analogous to UniXGrad (Friedman, 2019, Bernstein et al., 2024) and a smoothness-aware accumulation sequence. The analysis employs refined bounding of Bregman divergence terms and careful handling of stochastic noise and geometry, leading to optimal rates.

Main Results and Theoretical Guarantees

Projection-Free Leon (Without Acceleration):

Regret bound for online convex problems:

$\mathrm{Reg}_K \leq \mathcal{O}(\delta \mathcal{R} \dim(\mathcal{X}) + \mathcal{R} \norm{g_0}_{\#1} + \mathcal{R}\lVert \sqrt{S_K} \rVert)$

(with the key insight that $\delta > 0$ can be arbitrarily small, eliminating dimension dependence).

In stochastic non-smooth non-convex settings: a $(\gamma, \epsilon)$ -stationary point is reached in

$\mathcal{O}(\epsilon^{-2/(1+3\nu)})$ 0

iterations, with no need for gradient norm bounding.

Accelerated Projection-Free Leon:

For convex, $\mathcal{O}(\epsilon^{-2/(1+3\nu)})$ 1-Hölder smooth objectives (possibly stochastic):

$\mathcal{O}(\epsilon^{-2/(1+3\nu)})$ 2

Thus, to reach $\mathcal{O}(\epsilon^{-2/(1+3\nu)})$ 3 suboptimality, the required iterations are

$\mathcal{O}(\epsilon^{-2/(1+3\nu)})$ 4

with dimension-independent rates when $\mathcal{O}(\epsilon^{-2/(1+3\nu)})$ 5 is set sufficiently small.

Numerical and Methodological Highlights

Theoretical guarantees are attained without costly quadratic projections—a practical advance, particularly for large-scale or deep learning settings where each projection is prohibitive.
The methods are hyperparameter-robust: elimination of tuning for the regularization parameter removes a long-standing practical obstacle.
The analysis demonstrates that matrix-adaptive methods (e.g., One-sided Shampoo, projection-free variants) can provably exploit non-euclidean geometries (e.g., spectral or block norms) in situations with low-rank gradient structure and high-rank solutions, as often observed in deep neural networks.
The unified analysis provides a plug-and-play framework for verifying convergence of projection-free adaptive methods using different preconditioning geometries.

Implications and Future Directions

The results remove critical barriers for the deployment of scalable, matrix-adaptive, projection-free SGD methods in high-dimensional deep learning contexts. The projection-free Leon and its accelerated analog offer theoretically sound, computationally efficient optimizers for large models and non-euclidean geometries without resorting to post-step projections.

Theoretically, this establishes dimension-independent adaptivity for structured preconditioners in online and stochastic regimes, including non-smooth and non-convex problems. The unification with Nesterov-style acceleration for non-Euclidean geometries paves the way for further advances in fast, robust large-scale optimization.

Practically, the elimination of projections and extra hyperparameter tuning increases the accessibility and reliability of such optimizers in real-world training loops. The approach is particularly relevant for neural network architectures inducing low-rank or block-diagonal Hessian structures, such as multi-head attention or factorized weight layers.

Future work may explore:

Extension to distributed and federated optimization scenarios where projection-free updates further reduce communication or synchronization costs,
Application to implicit layers or bilevel optimization where projections are even less tractable,
Adaptation to non-monotone, composite, or constrained problems beyond spectral and infinity-ball geometries,
Improved empirical understanding and tuning guidelines in practical deep learning benchmarks.

Conclusion

This paper resolves key theoretical and practical challenges in projection-free adaptive matrix optimization. By providing dimension-free, projection-free regret and complexity bounds and supporting acceleration, the work places matrix-adaptive, preconditioned SGD variants as robust candidates for large-scale, non-euclidean, and deep learning applications. The contribution significantly tightens the link between theoretical optimality and practical tractability in structured stochastic optimization (2604.02505).

Markdown Report Issue