On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm (2505.11840v1)
Abstract: As the default optimizer for training LLMs, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}KE\left[|\nabla f(xk)|_1\right]\leq O(\frac{\sqrt{d}C}{K{1/4}})$ for AdamW measured by $\ell_1$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $E\left[|\nabla f(x)|1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[|\nabla f(x)|_2\right]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $|\nabla f(x)|_1=\varTheta(\sqrt{d})|\nabla f(x)|_2$. Both support that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum{k=1}KE\left[|\nabla f(xk)|_2\right]\leq O(\frac{C}{K{1/4}})$ convergence rate of SGD.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.