- The paper eliminates hyperparameter tuning by leveraging a new gradient stability property, enabling projection-free adaptive SGD without dimension-dependent penalties.
- The paper introduces the first accelerated projection-free One-sided Shampoo variant that achieves optimal iteration complexity for nu-Hölder smooth convex objectives.
- The paper unifies the analysis for various preconditioning schemes, broadening the applicability of matrix-adaptive methods in high-dimensional deep learning settings.
Optimal Projection-Free Adaptive SGD for Matrix Optimization: An Expert Summary
Overview and Motivation
This paper addresses fundamental limitations of matrix-preconditioned adaptive stochastic optimization methods—particularly One-sided Shampoo and its recent projection-free variant, Leon—for online convex and non-convex optimization. The work delivers a dimension-independent, hyperparameter-free, projection-free analysis of adaptive SGD with matrix preconditioning. It further introduces the first Nesterov-accelerated, projection-free variant of One-sided Shampoo, with optimal iteration complexity for convex ν-Hölder smooth objectives.
Background: Structured Adaptive Methods and Projection-Free Variants
Most practical deep learning optimization pipelines rely on coordinatewise adaptive methods such as AdaGrad and Adam; these adjust per-parameter learning rates using diagonal preconditioners. More recent structured preconditioners, as in One-sided Shampoo, capture richer curvature, allowing adaptation to low-rank or block-diagonal structures prevalent in neural networks, leveraging the spectral norm geometry for matrix-parameter spaces (Xie et al., 13 Mar 2025, An et al., 26 Mar 2025). However, such methods typically require constraining iterates within a norm ball via expensive quadratic projections; in One-sided Shampoo, the spectral norm projection is a computational bottleneck.
Leon, recently proposed as a projection-free analog of One-sided Shampoo via an FTRL update, circumvents this projection (Jiang et al., 9 Feb 2026). However, its theoretical guarantees require tuning an extra regularization hyperparameter and only yield dimension-independent rates under strong gradient boundedness assumptions—a severe restriction in stochastic or non-smooth contexts.
Main Contributions
The paper addresses these deficiencies in three major aspects:
- Elimination of Hyperparameter Tuning in Projection-Free Leon: The authors establish a new gradient stability property for Leon’s matrix preconditioner and leverage it to show that the regularizer parameter can be sent to zero, removing the need for tuning. This yields dimension-independent convergence guarantees for Leon in both online and non-smooth non-convex optimization regimes, even without strict gradient norm constraints.
- First Accelerated Projection-Free One-sided Shampoo: By extending UniXGrad-style gradient difference accumulation, the authors construct an accelerated projection-free variant of FTRL-Leon, achieving the optimal accelerated complexity O(ϵ−2/(1+3ν)) for ν-Hölder smooth convex objectives, paralleling the complexity of Nesterov-accelerated methods in non-Euclidean geometries but now in a projection-free, preconditioned context.
- Unified Analysis Supporting General Preconditioning: The analysis framework is generic, supporting block-diagonal, diagonal, and even scalar step-size preconditioners. This greatly broadens the applicability of the theoretical results to a variety of adaptive methods beyond just Shampoo variants.
Technical Approach
The paper builds from FTRL with matrix regularizers. Crucially, the dual regularizer Ψk∗​(m) is parameterized via the squared accumulated preconditioned gradients within a suitable self-adjoint operator subspace, generalizing diagonal and matrix preconditioning. The projection-free update is defined as xk+1​=−∇Ψk∗​(mk​), avoiding explicit projections onto spectral norm balls.
The key innovation is a lemma proving gradient stability for Leon’s preconditioner—bounding the change in the regularization gradient under incremental updates. This property enables sharper regret analyses, revealing that the additional regularization parameter δ can be sent to 0+ without
incurring dimension-dependent regret or complexity terms, and without relying on uniform gradient bounds.
For Nesterov acceleration, the work designs an accelerated update for FTRL-Leon, using gradient differences analogous to UniXGrad (Friedman, 2019, Bernstein et al., 2024) and a smoothness-aware accumulation sequence. The analysis employs refined bounding of Bregman divergence terms and careful handling of stochastic noise and geometry, leading to optimal rates.
Main Results and Theoretical Guarantees
Projection-Free Leon (Without Acceleration):
- Regret bound for online convex problems:
$\mathrm{Reg}_K \leq \mathcal{O}(\delta \mathcal{R} \dim(\mathcal{X}) + \mathcal{R} \norm{g_0}_{\#1} + \mathcal{R}\lVert \sqrt{S_K} \rVert)$
(with the key insight that δ>0 can be arbitrarily small, eliminating dimension dependence).
- In stochastic non-smooth non-convex settings: a (γ,ϵ)-stationary point is reached in
O(ϵ−2/(1+3ν))0
iterations, with no need for gradient norm bounding.
Accelerated Projection-Free Leon:
- For convex, O(ϵ−2/(1+3ν))1-Hölder smooth objectives (possibly stochastic):
O(ϵ−2/(1+3ν))2
- Thus, to reach O(ϵ−2/(1+3ν))3 suboptimality, the required iterations are
O(ϵ−2/(1+3ν))4
with dimension-independent rates when O(ϵ−2/(1+3ν))5 is set sufficiently small.
Numerical and Methodological Highlights
- Theoretical guarantees are attained without costly quadratic projections—a practical advance, particularly for large-scale or deep learning settings where each projection is prohibitive.
- The methods are hyperparameter-robust: elimination of tuning for the regularization parameter removes a long-standing practical obstacle.
- The analysis demonstrates that matrix-adaptive methods (e.g., One-sided Shampoo, projection-free variants) can provably exploit non-euclidean geometries (e.g., spectral or block norms) in situations with low-rank gradient structure and high-rank solutions, as often observed in deep neural networks.
- The unified analysis provides a plug-and-play framework for verifying convergence of projection-free adaptive methods using different preconditioning geometries.
Implications and Future Directions
The results remove critical barriers for the deployment of scalable, matrix-adaptive, projection-free SGD methods in high-dimensional deep learning contexts. The projection-free Leon and its accelerated analog offer theoretically sound, computationally efficient optimizers for large models and non-euclidean geometries without resorting to post-step projections.
Theoretically, this establishes dimension-independent adaptivity for structured preconditioners in online and stochastic regimes, including non-smooth and non-convex problems. The unification with Nesterov-style acceleration for non-Euclidean geometries paves the way for further advances in fast, robust large-scale optimization.
Practically, the elimination of projections and extra hyperparameter tuning increases the accessibility and reliability of such optimizers in real-world training loops. The approach is particularly relevant for neural network architectures inducing low-rank or block-diagonal Hessian structures, such as multi-head attention or factorized weight layers.
Future work may explore:
- Extension to distributed and federated optimization scenarios where projection-free updates further reduce communication or synchronization costs,
- Application to implicit layers or bilevel optimization where projections are even less tractable,
- Adaptation to non-monotone, composite, or constrained problems beyond spectral and infinity-ball geometries,
- Improved empirical understanding and tuning guidelines in practical deep learning benchmarks.
Conclusion
This paper resolves key theoretical and practical challenges in projection-free adaptive matrix optimization. By providing dimension-free, projection-free regret and complexity bounds and supporting acceleration, the work places matrix-adaptive, preconditioned SGD variants as robust candidates for large-scale, non-euclidean, and deep learning applications. The contribution significantly tightens the link between theoretical optimality and practical tractability in structured stochastic optimization (2604.02505).