Old Optimizer, New Norm: An Anthology (2409.20325v2)

Published 30 Sep 2024 in cs.LG and math.OC

Abstract: Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of $\mathbb{R}^{m\times n}$, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

Authors (2)

Jeremy Bernstein (25 papers)
Laker Newhouse (4 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper shows that Adam, Shampoo, and Prodigy, without exponential moving averages, function as steepest descent optimizers under distinct norm settings.
It systematically links sign descent, matrix projections, and operator norms to unifying principles in first-order deep learning optimization.
The authors propose that adopting modular and induced norms can enhance learning rate transfer and improve training robustness across neural network layers.

The paper "Old Optimizer, New Norm: An Anthology" posits that several deep learning optimizers, traditionally understood through the lens of convex or approximate second-order theory, can be re-interpreted as first-order methods that perform steepest descent under specific norms.

The authors focus on three optimizers: Adam, Shampoo, and Prodigy. After disabling their exponential moving averages (EMA), the paper argues that each of these methods is equivalent to steepest descent under a particular norm. The paper suggests that EMA serves to smooth the algorithm and enhance its robustness against mini-batch noise.

The authors rely on the following proposition of steepest descent:

$\argmin_{\Delta \in R^n} [g^\top \Delta + \frac{\lambda}{2} ||\Delta||^2] = -\frac{||g||^\dagger}{\lambda} \cdot \argmax_{||d||=1} g^\top d$

$\Delta \in R^n$ : Weight update vector
$g \in R^n$ : Gradient vector
$\lambda \geq 0$ : Sharpness parameter
$||\cdot||$ : Norm
$||\cdot||^\dagger$ : Dual norm

The paper emphasizes that the art of steepest descent lies in selecting a norm $||\cdot||$ and a sharpness parameter $\lambda$ appropriate for the optimization problem. The paper claims that existing methods implicitly make decisions about norms, but often in a unsystematic manner. These methods implicitly assign different induced matrix norms to the network layers.

The induced operator norm is defined as:

$||A||_{\alpha\to\beta} = \max_{x \in R^{d_{in}}, x \neq 0} \frac{||Ax||_\beta}{||x||_\alpha}$

$A \in R^{d_{out} \times d_{in}}$ : A matrix
$(R^{d_{in}}, ||\cdot||_\alpha)$ : Normed vector space
$(R^{d_{out}}, ||\cdot||_\beta)$ : Normed vector space

The paper argues that by varying the choice of vector norms $||\cdot||_\alpha$ and $||\cdot||_\beta$ , one can induce a family of matrix norms, which in turn implies a family of steepest descent optimizers.

The paper connects Adam to sign gradient descent. With EMA switched off ( $\beta_1 = \beta_2 = 0$ ), Adam updates reduce to:

$\theta_{t+1} = \theta_t - \eta \cdot \text{sign}(\nabla_t)$

The paper notes that sign descent solves the problem of steepest descent under the vector $\ell_\infty$ norm, $||\theta||_\infty = \max_i |\theta_i|$ .

The paper connects the vector $\ell_\infty$ norm to neural network training. For a neural network with a list of $L$ weight matrices $W_1, \dots, W_L$ , let $\text{row}_r(W_l)$ denote the $r$ th row of the $l$ th weight matrix, and let $W = \text{flatten}(W_1, \dots, W_L) \in R^n$ denote the full flattened weight vector. Then:

$||W||_\infty = \max_{l} \max_r ||\text{row}_r(W_l)||_\infty = \max_{l} ||W_l||_{\ell_1\to\ell_\infty}$

The paper refers to the object $\max_{l} ||W_l||_{\ell_1\to\ell_\infty}$ as the "max-of-max norm." The paper notes that sign descent emerges as steepest descent under this norm.

For a list of gradient matrices $G_1,...,G_L$ and any sharpness $\lambda > 0$ , consider the problem: $\argmin_{\Delta W_1,...,\Delta W_L} [\sum_{l=1}^L \langle G_l, \Delta W_l\rangle + \frac{\lambda}{2} \max_{l=1}^L ||\Delta W_l||_{\ell_1\to\ell_\infty}^2]$ ,

where $\langle\cdot, \cdot\rangle$ denotes the Frobenius inner product, and $\Delta W_l$ has the same shape as $G_l$ . For step size $\eta = \frac{1}{\lambda}\sum_{l=1}^L ||G_l||_{\ell_1\to\ell_\infty}^\dagger$ , where $\dagger$ denotes the dual norm, the above equation is solved by: $\Delta W_l = - \eta \cdot \text{sign}(G_l)$ for each layer $l=1,...,L$ .

The authors note that this observation (that sign descent updates are implicitly doing per-matrix gradient normalization) may be a major reason that Adam, sign descent, and Lion outperform vanilla gradient descent in LLM training.

The paper shows that Shampoo updates, without accumulation, are semi-orthogonal matrices. At time step $t$ and for each layer, Shampoo collects the gradient matrix $G_t$ and makes the following update to the weight matrix $W_t$ :

$L_t = L_{t-1} + G_t G_t^T$

$R_t = R_{t-1} + G_t^T G_t$

$W_{t+1} = W_t - \eta \cdot L_t^{-\frac{1}{4}} G_t R_t^{-\frac{1}{4}}$

The accumulators $L_t$ and $R_t$ are referred to as the "left and right pre-conditioners". If accumulation is disabled, setting $L_t = G_t G_t^\top$ and $R_t = G_t^\top G_t$ , Shampoo reduces to:

$W_{t+1} = W_t - \eta \cdot G_t G_t^\top$

Shampoo without accumulation projects the gradient matrix to the closest semi-orthogonal matrix in Frobenius norm. For semi-orthogonal matrices $\mathcal{O}_{m \times n} = \{A \in R^{m \times n} : A^\top A = I_{m} \text{ or } AA^\top = I_{n}\}$ and Frobenius norm $||\cdot||_F$ , for any matrix $A \in R^{m \times n}$ with reduced SVD $A = U \Sigma V^\top$ : $\argmin_{X \in \mathcal{O}_{m \times n}} ||A - X||_F = UV^\top$ ,

where the minimizer $UV^\top$ is unique if and only if the matrix $\Sigma$ has full rank.

The paper claims that Shampoo is steepest descent under the maximum spectral norm $||A||_{\ell_2 \to \ell_2}$ over all the matrices in the network.

For gradient matrices $G_1,...,G_L$ and sharpness $\lambda > 0$ : $\argmin_{\Delta W_1,...,\Delta W_L} [\sum_{l=1}^L \langle G_l, \Delta W_l\rangle + \frac{\lambda}{2} \max_{l=1}^L ||\Delta W_l||_{\ell_2\to\ell_2}^2 ]$ ,

where $\langle\cdot, \cdot\rangle$ denotes the Frobenius inner product and $\Delta W_l$ has the same shape as $G_l$ . If $G_l$ has reduced SVD given by $G_l = U_l \Sigma_l V_l^\top$ for each $l=1,...,L$ , then the above equation is solved with a step size $\eta = \frac{1}{\lambda}\sum_{l=1}^L \text{tr} \Sigma_l$ and an update: $\Delta W_l = -\eta \cdot U_l V_l^\top \quad \text{ for each } l=1,...,L$ .

The paper connects the spectral norm to the square loss. For a matrix $W \in R^{d_\text{out}\times d _\text{in}}$ (a linear predictor mapping an input $x \in R^{d_\text{in}}$ to an output $y = Wx \in R^{d_\text{out}}$ ), and a dataset of $n$ samples $\mathcal{D} = \{(x_1,y_1), ..., (x_n,y_n)\}$ , where the $i$ th input is normalized such that $||x_i||_2 = \sqrt{d_{in}}$ , the square loss is: $L(W) = \frac{1}{2n}\sum_{i=1}^n \frac{1}{d_\text{out}}||y_i - W x_i||_2^2$ .

Then, for any matrix $\Delta W \in R^{d_\text{out}\times d _\text{in}}$ (a weight update): $L(W + \Delta W) \leq L(W) + \langle\nabla_L(W), \Delta W \rangle + 1 \cdot \frac{d_{in}}{d_{out}}\cdot||\Delta W||_{\ell_2\to\ell_2}^2$ ,

where $\langle\cdot,\cdot\rangle$ is the Frobenius inner product.

The square loss of a linear predictor admits an upper bound that is quadratic in the spectral norm of the weight perturbation. Choosing the weight perturbation to minimize this upper bound is steepest descent under the spectral norm.

The paper claims that Prodigy (without EMA) is another example of steepest descent, where instead of using the step size $\eta=||g||^\dagger/\lambda$ from the steepest descent proposition, Prodigy uses a heuristic to automatically warm up to a good step size.

With EMA switched off ( $\beta_1 = \beta_2 = 0$ ), the Prodigy updates simplify dramatically to sign gradient descent with a dynamical step size as follows:

$\eta_{t+1} = \max(\eta_t, \frac{G_t^\top(W_0 - W_t)}{||G_t||_1})$

$W_{t+1} = W_t - \eta_t \cdot \text{sign}(G_t)$ .

The paper claims that Prodigy without EMA is steepest descent, although with a dynamically chosen step size denoted $\eta_t$ , and that the dynamical rule approximates a heuristic algorithm for achieving escape velocity. The paper defines escape velocity as choosing a very small initial step size $\eta_0$ , checking at each step if the weights $W_t$ have escaped the linearization of the loss around the initial weights $W_0$ , and doubling the step size according to $\eta_{t+1} = 2 \times \eta_t$ if not.

The paper then massages Prodigy's step size update as follows: $\eta_{t+1} = \max(\eta_t, \frac{G_t^\top(W_0 - W_t)}{||G_t||_1}) = \max(\eta_t, \frac{||G_t||_2}{||G_t||_1}\times ||W_t - W_0||_2 \times \cos\theta)$ ,

where $\theta$ denotes the angle between the gradient $G_t$ and the difference in weights $W_0 - W_t$ . The paper then makes two assumptions: the gradient is a "dense" vector in $R^n$ , meaning that $||G_t||_2 / ||G_t||_1 \approx 1/ \sqrt{n}$ , and $W_t$ is still close enough to the initialization $W_0$ that $\cos\theta \approx 1$ . Under these assumptions, the above equation becomes just $\eta_{t+1} \approx \max(\eta_t, ||W_t - W_0||_{RMS})$ , where the root mean square (RMS) norm is defined via $||\cdot||_{RMS} = \frac{1}{\sqrt{n}}||\cdot||_2$ . Combined with $W_{t+1} = W_t - \eta_t \cdot \text{sign} (G_t)$ , this allows the authors to estimate the size of the weight change at step $t+1$ : $||W_{t+2} - W_{t+1}||_{RMS} = \eta_{t+1} \cdot ||\text{sign}(G_t)||_{RMS} \approx \max(\eta_t, ||W_t - W_0||_{RMS}) \geq ||W_t - W_0||_{RMS}$ .

The paper then introduces the modular norm and its corresponding steepest descent algorithm. Given scalar coefficients $s_1, \dots, s_L > 0$ and norms $||\cdot||_1, \dots, ||\cdot||_L$ , the modular norm is defined as the mapping: $W_1, \dots, W_L \mapsto \max\{s_1 ||W_1||_1, \dots, s_L ||W_L||_L \}$ .

The corresponding steepest descent problem is given by: $\argmin_{\Delta W_1, \dots, \Delta W_L} [\sum_{l=1}^L \langle G_l, \Delta W_l \rangle + \frac{\lambda}{2} \max_{l=1}^L s_l^2 ||\Delta W_l||_l^2 ]$ ,

where $\langle\cdot, \cdot\rangle$ denotes the Frobenius inner product, and for each $l=1,...,L$ the two matrices $\Delta W_l$ and $G_l$ are of the same shape. If the global step size $\eta = \frac{1}{\lambda}\sum_{k=1}^L \frac{1}{s_k}||G_k||_k^\dagger$ , then the solution is given by: $\Delta W_l = - \frac{\eta}{s_l}\cdot \argmax_{||W_l||_l=1} \, \langle G_l, W_l\rangle \quad\text{ for each layer } l = 1,...,L$ .

The paper notes that the $\ell_1\to\ell_p$ operator norm is the largest $\ell_p$ norm of the columns; the $\ell_p\to\ell_\infty$ operator norm is the largest dual $\ell_p$ norm over the rows. For a matrix $A\inR^{m\times n}$ with $m$ rows $\{\text{row}_i(A)\}_{i=1}^m$ and $n$ columns $\{\text{col}_j(A)\}_{j=1}^n$ , and $1\leq p \leq \infty$ :

$||A||_{\ell_1\to \ell_p} = \max_j ||\text{col}_j(A)||_p$

$||A||_{\ell_p\to \ell_\infty}= \max_i ||\text{row}_i(A)||_{\frac{p}{p-1}}$ .

The paper concludes that by equipping neural network layers with better norms, it could lead to learning rate transfer across scale.

PDF Markdown

Related Papers

Tweets

https://twitter.com/leloykun/status/1847919153589735705

https://twitter.com/jxbz/status/1840205445945762090

https://twitter.com/Ar_Douillard/status/1888980810856059251

https://twitter.com/konstmish/status/1841360213141491780

https://twitter.com/s_scardapane/status/1844688733812965758

https://twitter.com/mattecapu/status/1847943174129352713