SF-AdamW: Scale-Free Optimization

Updated 16 July 2025

SF-AdamW is a family of scale-invariant optimizers that extend AdamW to reduce learning rate sensitivity and improve training robustness.
It generalizes standard moment tracking by incorporating tunable exponents for gradient powers and update scaling, offering flexible adaptation to loss landscapes.
Empirical results on toy benchmarks and deep learning tasks demonstrate that SF-AdamW achieves faster convergence and higher accuracy compared to traditional AdamW.

SF-AdamW refers to "Scale-Free AdamW," a family of theoretical and practical extensions to the AdamW optimizer designed to address its sensitivity to learning rate scaling and improve its stability and generalization across differing problem structures, especially in deep learning contexts. SF-AdamW encompasses innovations in moment tracking, update magnitude modulation, and norm-invariant optimization, and it is most rigorously formulated and empirically tested in the Aida optimizer, which generalizes AdamW’s per-coordinate adaptivity through parameterized exponents. The term also denotes the broader principle of scale-freeness: optimizer updates that remain invariant under per-coordinate rescaling of the gradients, a property that has been shown to confer practical benefits in deep network training.

1. Motivation and Foundational Limitations of AdamW

AdamW is a widely used adaptive gradient method known for its decoupling of weight decay from the adaptive update step. Local convergence theory shows, however, that AdamW’s stability near optima depends critically on a sufficiently small learning rate. This places an unnaturally strict requirement on hyperparameter tuning, especially in the presence of strongly varying curvature or scaling across coordinates, which is common in deep neural networks. The need for improved learning rate tolerance and robustness to large-scale differences in gradient magnitudes motivated the development of SF-AdamW and its principal instantiation, Aida (Zhang et al., 2021).

2. Mathematical Structure and Generalization of AdamW

Aida extends AdamW by introducing two pivotal generalizations:

Generalized Second Moment Tracking: Instead of tracking the exponential moving average of squared gradients (the second moment, $v_t$ ), Aida tracks $r_t$ , the moving average of the $p$ -th power of gradient magnitudes:

$r_{t+1} = \beta_2 r_t + (1 - \beta_2) |g(x_t)|^p$

Here, $p \geq 1$ is a tunable hyperparameter; $p = 2$ recovers standard AdamW behavior.

Flexible Magnitude Scaling in Updates: The update direction modulates not only the first moment but its $q$ -th power, and scales this by $(r_{t+1} + \varepsilon)^{q/p}$ :

$x_{t+1} = (1-\mu) x_t - \eta \cdot \frac{|m_{t+1}|^q \cdot \mathrm{sign}(m_{t+1})}{(r_{t+1} + \varepsilon)^{q/p}}$

Setting $(p, q) = (2, 1)$ recovers AdamW, but Aida allows $q > 1$ and $p > 1$ , providing additional flexibility to shape adaptive per-coordinate learning rates.

This generalization enables SF-AdamW/Aida to modulate individual coordinate updates more aggressively or conservatively, depending on the curvature and dynamics of the loss surface.

3. Theoretical Analysis: Local Stability and Learning Rate Constraints

Local convergence of AdamW, as analyzed in a discrete dynamical systems perspective, requires that the learning rate $\eta$ satisfies:

$\frac{\eta \cdot \max_k \gamma_k}{\sqrt{\varepsilon}(1 - \beta_1)} < 2\beta_1 + 2(1-\mu)$

where $\gamma_k$ are eigenvalues of the Hessian and $\mu$ is the weight decay. This severely restricts the learning rate when $\gamma_k$ is large.

In contrast, Aida with $q > 1, p > 1$ fundamentally alters the Jacobian dynamics at the optimum. The key eigenvalues become functionally decoupled from $\eta$ and the Hessian; local stability is instead ensured solely by nonzero weight decay ( $\mu > 0$ ) with eigenvalues $\lambda_1 = \beta_1$ , $\lambda_2 = \beta_2$ , $\lambda_3 = 1-\mu$ . This relaxes constraints on the learning rate, enabling reliable use of larger step sizes while retaining local stability, provided proper weight decay is enforced.

4. Empirical Performance on Synthetic and Deep Learning Tasks

Extensive experiments were conducted on both toy optimization problems and modern deep learning applications:

Toy Problems: Across ten benchmark functions, Aida with $(p, q) \neq (2, 1)$ often outperformed standard AdamW, showing both faster convergence and lower residual gradients in many cases. The configuration $(1, 2)$ , in particular, demonstrated aggressive convergence on difficult loss landscapes.
Deep Learning Tasks: On WMT16 multimodal translation with Transformers, Aida $(1, 2)$ achieved a ≈3% gain in validation accuracy over standard AdamW. On Swin-Transformer for CIFAR10, similar accuracy improvements were observed. These results suggest that the generalized scaling and second-moment adaptation confer meaningful performance boosts in practical, large-scale training settings.

5. Scale-Freeness: Invariance to Gradient Rescalings

“Scale-freeness” denotes the property that the optimizer’s updates are invariant to coordinate-wise gradient rescaling. In AdamW (and thus SF-AdamW/Aida with $(p, q) = (2, 1)$ ), if any coordinate of the gradient is multiplied by a positive scalar, the structure of the first and second moments ensures this rescaling cancels in the update:

$x_t = (1 - \lambda\eta_t)x_{t-1} - \eta_t \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}$

This means the optimizer is robust to imbalance in gradient scales—a common situation in deep networks without normalization layers (Zhuang et al., 2022).

Empirical studies confirm that AdamW’s performance and hyperparameter sensitivity remain unchanged when the overall loss scale is modified, whereas standard Adam- $\ell_2$ quickly degrades under rescaling. Additionally, AdamW maintains concentrated update distributions, while Adam- $\ell_2$ produces updates that are widely spread due to scaling sensitivity.

6. Practical Implications and Tuning Guidelines

SF-AdamW/Aida’s flexibility introduces new hyperparameter dimensions ( $p$ , $q$ ), which can be tuned to exploit specific loss landscape geometries. Optimal performance often requires:

Using $q > 1$ and $p > 1$ for improved stability at larger learning rates, with nonzero weight decay.
Tailoring $(p, q)$ to the application: for deep learning tasks, $(1, 2)$ has been shown to yield measurable gains over standard settings.
Careful adjustment of weight decay, as it becomes central to guaranteeing stability when $q > 1$ .

In practical deployment, this means that model- and task-specific sweeps over $(p, q)$ and $\mu$ can lead to performance improvements and more robust convergence, especially for architectures where gradient magnitudes or curvature are highly non-uniform.

7. Broader Significance and Future Directions

The SF-AdamW paradigm broadens the set of theoretically principled, practical optimizer choices for deep learning. By relaxing the strong dependence of stability on small learning rates—and by leveraging norm-invariant adaptive mechanics—it enables more aggressive, robust, and scale-agnostic learning. Future work is directed at:

Extending these mechanisms to distributed and parallel optimization frameworks.
Investigating the interplay of $(p, q)$ and advanced weight decay schemes in even larger architectures.
Integrating these ideas with emerging norm-aware or memory-efficient optimizer families, and exploring their implications for generalization theory.

The scale-free property and parameterized adaptivity found in SF-AdamW/Aida represent a substantive advance in adaptive optimization for high-dimensional, nonlinear models, with both theoretical rigor and empirical validation supporting their continued development and adoption.

PDF Markdown Chat (Pro)

References (2)

Extending AdamW by Leveraging Its Second Moment and Magnitude (2021)

Understanding AdamW through Proximal Methods and Scale-Freeness (2022)

Follow Topic

Get notified by email when new papers are published related to SF-AdamW.