Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Old Optimizer, New Norm: An Anthology (2409.20325v2)

Published 30 Sep 2024 in cs.LG and math.OC

Abstract: Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of $\mathbb{R}{m\times n}$, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Jeremy Bernstein (25 papers)
  2. Laker Newhouse (4 papers)
Citations (3)

Summary

  • The paper shows that Adam, Shampoo, and Prodigy, without exponential moving averages, function as steepest descent optimizers under distinct norm settings.
  • It systematically links sign descent, matrix projections, and operator norms to unifying principles in first-order deep learning optimization.
  • The authors propose that adopting modular and induced norms can enhance learning rate transfer and improve training robustness across neural network layers.

The paper "Old Optimizer, New Norm: An Anthology" posits that several deep learning optimizers, traditionally understood through the lens of convex or approximate second-order theory, can be re-interpreted as first-order methods that perform steepest descent under specific norms.

The authors focus on three optimizers: Adam, Shampoo, and Prodigy. After disabling their exponential moving averages (EMA), the paper argues that each of these methods is equivalent to steepest descent under a particular norm. The paper suggests that EMA serves to smooth the algorithm and enhance its robustness against mini-batch noise.

The authors rely on the following proposition of steepest descent:

arg minΔRn[gΔ+λ2Δ2]=gλarg maxd=1gd\argmin_{\Delta \in R^n} [g^\top \Delta + \frac{\lambda}{2} ||\Delta||^2] = -\frac{||g||^\dagger}{\lambda} \cdot \argmax_{||d||=1} g^\top d

  • ΔRn\Delta \in R^n: Weight update vector
  • gRng \in R^n: Gradient vector
  • λ0\lambda \geq 0: Sharpness parameter
  • ||\cdot||: Norm
  • ||\cdot||^\dagger: Dual norm

The paper emphasizes that the art of steepest descent lies in selecting a norm ||\cdot|| and a sharpness parameter λ\lambda appropriate for the optimization problem. The paper claims that existing methods implicitly make decisions about norms, but often in a unsystematic manner. These methods implicitly assign different induced matrix norms to the network layers.

The induced operator norm is defined as:

Aαβ=maxxRdin,x0Axβxα||A||_{\alpha\to\beta} = \max_{x \in R^{d_{in}}, x \neq 0} \frac{||Ax||_\beta}{||x||_\alpha}

  • ARdout×dinA \in R^{d_{out} \times d_{in}}: A matrix
  • (Rdin,α)(R^{d_{in}}, ||\cdot||_\alpha): Normed vector space
  • (Rdout,β)(R^{d_{out}}, ||\cdot||_\beta): Normed vector space

The paper argues that by varying the choice of vector norms α||\cdot||_\alpha and β||\cdot||_\beta, one can induce a family of matrix norms, which in turn implies a family of steepest descent optimizers.

The paper connects Adam to sign gradient descent. With EMA switched off (β1=β2=0\beta_1 = \beta_2 = 0), Adam updates reduce to:

θt+1=θtηsign(t)\theta_{t+1} = \theta_t - \eta \cdot \text{sign}(\nabla_t)

The paper notes that sign descent solves the problem of steepest descent under the vector \ell_\infty norm, θ=maxiθi||\theta||_\infty = \max_i |\theta_i|.

The paper connects the vector \ell_\infty norm to neural network training. For a neural network with a list of LL weight matrices W1,,WLW_1, \dots, W_L, let rowr(Wl)\text{row}_r(W_l) denote the rrth row of the llth weight matrix, and let W=flatten(W1,,WL)RnW = \text{flatten}(W_1, \dots, W_L) \in R^n denote the full flattened weight vector. Then:

W=maxlmaxrrowr(Wl)=maxlWl1||W||_\infty = \max_{l} \max_r ||\text{row}_r(W_l)||_\infty = \max_{l} ||W_l||_{\ell_1\to\ell_\infty}

The paper refers to the object maxlWl1\max_{l} ||W_l||_{\ell_1\to\ell_\infty} as the "max-of-max norm." The paper notes that sign descent emerges as steepest descent under this norm.

For a list of gradient matrices G1,...,GLG_1,...,G_L and any sharpness λ>0\lambda > 0, consider the problem: arg minΔW1,...,ΔWL[l=1LGl,ΔWl+λ2maxl=1LΔWl12]\argmin_{\Delta W_1,...,\Delta W_L} [\sum_{l=1}^L \langle G_l, \Delta W_l\rangle + \frac{\lambda}{2} \max_{l=1}^L ||\Delta W_l||_{\ell_1\to\ell_\infty}^2],

where ,\langle\cdot, \cdot\rangle denotes the Frobenius inner product, and ΔWl\Delta W_l has the same shape as GlG_l. For step size η=1λl=1LGl1\eta = \frac{1}{\lambda}\sum_{l=1}^L ||G_l||_{\ell_1\to\ell_\infty}^\dagger, where \dagger denotes the dual norm, the above equation is solved by: ΔWl=ηsign(Gl)\Delta W_l = - \eta \cdot \text{sign}(G_l) for each layer l=1,...,Ll=1,...,L.

The authors note that this observation (that sign descent updates are implicitly doing per-matrix gradient normalization) may be a major reason that Adam, sign descent, and Lion outperform vanilla gradient descent in LLM training.

The paper shows that Shampoo updates, without accumulation, are semi-orthogonal matrices. At time step tt and for each layer, Shampoo collects the gradient matrix GtG_t and makes the following update to the weight matrix WtW_t:

Lt=Lt1+GtGtTL_t = L_{t-1} + G_t G_t^T

Rt=Rt1+GtTGtR_t = R_{t-1} + G_t^T G_t

Wt+1=WtηLt14GtRt14W_{t+1} = W_t - \eta \cdot L_t^{-\frac{1}{4}} G_t R_t^{-\frac{1}{4}}

The accumulators LtL_t and RtR_t are referred to as the "left and right pre-conditioners". If accumulation is disabled, setting Lt=GtGtL_t = G_t G_t^\top and Rt=GtGtR_t = G_t^\top G_t, Shampoo reduces to:

Wt+1=WtηGtGtW_{t+1} = W_t - \eta \cdot G_t G_t^\top

Shampoo without accumulation projects the gradient matrix to the closest semi-orthogonal matrix in Frobenius norm. For semi-orthogonal matrices Om×n={ARm×n:AA=Im or AA=In}\mathcal{O}_{m \times n} = \{A \in R^{m \times n} : A^\top A = I_{m} \text{ or } AA^\top = I_{n}\} and Frobenius norm F||\cdot||_F, for any matrix ARm×nA \in R^{m \times n} with reduced SVD A=UΣVA = U \Sigma V^\top: arg minXOm×nAXF=UV\argmin_{X \in \mathcal{O}_{m \times n}} ||A - X||_F = UV^\top,

where the minimizer UVUV^\top is unique if and only if the matrix Σ\Sigma has full rank.

The paper claims that Shampoo is steepest descent under the maximum spectral norm A22||A||_{\ell_2 \to \ell_2} over all the matrices in the network.

For gradient matrices G1,...,GLG_1,...,G_L and sharpness λ>0\lambda > 0: arg minΔW1,...,ΔWL[l=1LGl,ΔWl+λ2maxl=1LΔWl222]\argmin_{\Delta W_1,...,\Delta W_L} [\sum_{l=1}^L \langle G_l, \Delta W_l\rangle + \frac{\lambda}{2} \max_{l=1}^L ||\Delta W_l||_{\ell_2\to\ell_2}^2 ],

where ,\langle\cdot, \cdot\rangle denotes the Frobenius inner product and ΔWl\Delta W_l has the same shape as GlG_l. If GlG_l has reduced SVD given by Gl=UlΣlVlG_l = U_l \Sigma_l V_l^\top for each l=1,...,Ll=1,...,L, then the above equation is solved with a step size η=1λl=1LtrΣl\eta = \frac{1}{\lambda}\sum_{l=1}^L \text{tr} \Sigma_l and an update: ΔWl=ηUlVl for each l=1,...,L\Delta W_l = -\eta \cdot U_l V_l^\top \quad \text{ for each } l=1,...,L.

The paper connects the spectral norm to the square loss. For a matrix WRdout×dinW \in R^{d_\text{out}\times d _\text{in}} (a linear predictor mapping an input xRdinx \in R^{d_\text{in}} to an output y=WxRdouty = Wx \in R^{d_\text{out}}), and a dataset of nn samples D={(x1,y1),...,(xn,yn)}\mathcal{D} = \{(x_1,y_1), ..., (x_n,y_n)\}, where the iith input is normalized such that xi2=din||x_i||_2 = \sqrt{d_{in}}, the square loss is: L(W)=12ni=1n1doutyiWxi22L(W) = \frac{1}{2n}\sum_{i=1}^n \frac{1}{d_\text{out}}||y_i - W x_i||_2^2.

Then, for any matrix ΔWRdout×din\Delta W \in R^{d_\text{out}\times d _\text{in}} (a weight update): L(W+ΔW)L(W)+L(W),ΔW+1dindoutΔW222L(W + \Delta W) \leq L(W) + \langle\nabla_L(W), \Delta W \rangle + 1 \cdot \frac{d_{in}}{d_{out}}\cdot||\Delta W||_{\ell_2\to\ell_2}^2,

where ,\langle\cdot,\cdot\rangle is the Frobenius inner product.

The square loss of a linear predictor admits an upper bound that is quadratic in the spectral norm of the weight perturbation. Choosing the weight perturbation to minimize this upper bound is steepest descent under the spectral norm.

The paper claims that Prodigy (without EMA) is another example of steepest descent, where instead of using the step size η=g/λ\eta=||g||^\dagger/\lambda from the steepest descent proposition, Prodigy uses a heuristic to automatically warm up to a good step size.

With EMA switched off (β1=β2=0\beta_1 = \beta_2 = 0), the Prodigy updates simplify dramatically to sign gradient descent with a dynamical step size as follows:

ηt+1=max(ηt,Gt(W0Wt)Gt1)\eta_{t+1} = \max(\eta_t, \frac{G_t^\top(W_0 - W_t)}{||G_t||_1})

Wt+1=Wtηtsign(Gt)W_{t+1} = W_t - \eta_t \cdot \text{sign}(G_t).

The paper claims that Prodigy without EMA is steepest descent, although with a dynamically chosen step size denoted ηt\eta_t, and that the dynamical rule approximates a heuristic algorithm for achieving escape velocity. The paper defines escape velocity as choosing a very small initial step size η0\eta_0, checking at each step if the weights WtW_t have escaped the linearization of the loss around the initial weights W0W_0, and doubling the step size according to ηt+1=2×ηt\eta_{t+1} = 2 \times \eta_t if not.

The paper then massages Prodigy's step size update as follows: ηt+1=max(ηt,Gt(W0Wt)Gt1)=max(ηt,Gt2Gt1×WtW02×cosθ)\eta_{t+1} = \max(\eta_t, \frac{G_t^\top(W_0 - W_t)}{||G_t||_1}) = \max(\eta_t, \frac{||G_t||_2}{||G_t||_1}\times ||W_t - W_0||_2 \times \cos\theta),

where θ\theta denotes the angle between the gradient GtG_t and the difference in weights W0WtW_0 - W_t. The paper then makes two assumptions: the gradient is a "dense" vector in RnR^n, meaning that Gt2/Gt11/n||G_t||_2 / ||G_t||_1 \approx 1/ \sqrt{n}, and WtW_t is still close enough to the initialization W0W_0 that cosθ1\cos\theta \approx 1. Under these assumptions, the above equation becomes just ηt+1max(ηt,WtW0RMS)\eta_{t+1} \approx \max(\eta_t, ||W_t - W_0||_{RMS}), where the root mean square (RMS) norm is defined via RMS=1n2||\cdot||_{RMS} = \frac{1}{\sqrt{n}}||\cdot||_2. Combined with Wt+1=Wtηtsign(Gt)W_{t+1} = W_t - \eta_t \cdot \text{sign} (G_t), this allows the authors to estimate the size of the weight change at step t+1t+1: Wt+2Wt+1RMS=ηt+1sign(Gt)RMSmax(ηt,WtW0RMS)WtW0RMS||W_{t+2} - W_{t+1}||_{RMS} = \eta_{t+1} \cdot ||\text{sign}(G_t)||_{RMS} \approx \max(\eta_t, ||W_t - W_0||_{RMS}) \geq ||W_t - W_0||_{RMS}.

The paper then introduces the modular norm and its corresponding steepest descent algorithm. Given scalar coefficients s1,,sL>0s_1, \dots, s_L > 0 and norms 1,,L||\cdot||_1, \dots, ||\cdot||_L, the modular norm is defined as the mapping: W1,,WLmax{s1W11,,sLWLL}W_1, \dots, W_L \mapsto \max\{s_1 ||W_1||_1, \dots, s_L ||W_L||_L \}.

The corresponding steepest descent problem is given by: arg minΔW1,,ΔWL[l=1LGl,ΔWl+λ2maxl=1Lsl2ΔWll2]\argmin_{\Delta W_1, \dots, \Delta W_L} [\sum_{l=1}^L \langle G_l, \Delta W_l \rangle + \frac{\lambda}{2} \max_{l=1}^L s_l^2 ||\Delta W_l||_l^2 ],

where ,\langle\cdot, \cdot\rangle denotes the Frobenius inner product, and for each l=1,...,Ll=1,...,L the two matrices ΔWl\Delta W_l and GlG_l are of the same shape. If the global step size η=1λk=1L1skGkk\eta = \frac{1}{\lambda}\sum_{k=1}^L \frac{1}{s_k}||G_k||_k^\dagger, then the solution is given by: ΔWl=ηslarg maxWll=1Gl,Wl for each layer l=1,...,L\Delta W_l = - \frac{\eta}{s_l}\cdot \argmax_{||W_l||_l=1} \, \langle G_l, W_l\rangle \quad\text{ for each layer } l = 1,...,L.

The paper notes that the 1p\ell_1\to\ell_p operator norm is the largest p\ell_p norm of the columns; the p\ell_p\to\ell_\infty operator norm is the largest dual p\ell_p norm over the rows. For a matrix $A\inR^{m\times n}$ with mm rows {rowi(A)}i=1m\{\text{row}_i(A)\}_{i=1}^m and nn columns {colj(A)}j=1n\{\text{col}_j(A)\}_{j=1}^n, and 1p1\leq p \leq \infty:

A1p=maxjcolj(A)p||A||_{\ell_1\to \ell_p} = \max_j ||\text{col}_j(A)||_p

Ap=maxirowi(A)pp1||A||_{\ell_p\to \ell_\infty}= \max_i ||\text{row}_i(A)||_{\frac{p}{p-1}}.

The paper concludes that by equipping neural network layers with better norms, it could lead to learning rate transfer across scale.