Old Optimizer, New Norm: An Anthology
(2409.20325v2)
Published 30 Sep 2024 in cs.LG and math.OC
Abstract: Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of $\mathbb{R}{m\times n}$, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.
The paper shows that Adam, Shampoo, and Prodigy, without exponential moving averages, function as steepest descent optimizers under distinct norm settings.
It systematically links sign descent, matrix projections, and operator norms to unifying principles in first-order deep learning optimization.
The authors propose that adopting modular and induced norms can enhance learning rate transfer and improve training robustness across neural network layers.
The paper "Old Optimizer, New Norm: An Anthology" posits that several deep learning optimizers, traditionally understood through the lens of convex or approximate second-order theory, can be re-interpreted as first-order methods that perform steepest descent under specific norms.
The authors focus on three optimizers: Adam, Shampoo, and Prodigy. After disabling their exponential moving averages (EMA), the paper argues that each of these methods is equivalent to steepest descent under a particular norm. The paper suggests that EMA serves to smooth the algorithm and enhance its robustness against mini-batch noise.
The authors rely on the following proposition of steepest descent:
The paper emphasizes that the art of steepest descent lies in selecting a norm ∣∣⋅∣∣ and a sharpness parameter λ appropriate for the optimization problem. The paper claims that existing methods implicitly make decisions about norms, but often in a unsystematic manner. These methods implicitly assign different induced matrix norms to the network layers.
The induced operator norm is defined as:
∣∣A∣∣α→β=x∈Rdin,x=0max∣∣x∣∣α∣∣Ax∣∣β
A∈Rdout×din: A matrix
(Rdin,∣∣⋅∣∣α): Normed vector space
(Rdout,∣∣⋅∣∣β): Normed vector space
The paper argues that by varying the choice of vector norms ∣∣⋅∣∣α and ∣∣⋅∣∣β, one can induce a family of matrix norms, which in turn implies a family of steepest descent optimizers.
The paper connects Adam to sign gradient descent. With EMA switched off (β1=β2=0), Adam updates reduce to:
θt+1=θt−η⋅sign(∇t)
The paper notes that sign descent solves the problem of steepest descent under the vector ℓ∞ norm, ∣∣θ∣∣∞=maxi∣θi∣.
The paper connects the vector ℓ∞ norm to neural network training. For a neural network with a list of L weight matrices W1,…,WL, let rowr(Wl) denote the rth row of the lth weight matrix, and let W=flatten(W1,…,WL)∈Rn denote the full flattened weight vector. Then:
The paper refers to the object maxl∣∣Wl∣∣ℓ1→ℓ∞ as the "max-of-max norm." The paper notes that sign descent emerges as steepest descent under this norm.
For a list of gradient matrices G1,...,GL and any sharpness λ>0, consider the problem:
ΔW1,...,ΔWLargmin[l=1∑L⟨Gl,ΔWl⟩+2λl=1maxL∣∣ΔWl∣∣ℓ1→ℓ∞2],
where ⟨⋅,⋅⟩ denotes the Frobenius inner product, and ΔWl has the same shape as Gl. For step size η=λ1l=1∑L∣∣Gl∣∣ℓ1→ℓ∞†, where † denotes the dual norm, the above equation is solved by:
ΔWl=−η⋅sign(Gl) for each layer l=1,...,L.
The authors note that this observation (that sign descent updates are implicitly doing per-matrix gradient normalization) may be a major reason that Adam, sign descent, and Lion outperform vanilla gradient descent in LLM training.
The paper shows that Shampoo updates, without accumulation, are semi-orthogonal matrices. At time step t and for each layer, Shampoo collects the gradient matrix Gt and makes the following update to the weight matrix Wt:
Lt=Lt−1+GtGtT
Rt=Rt−1+GtTGt
Wt+1=Wt−η⋅Lt−41GtRt−41
The accumulators Lt and Rt are referred to as the "left and right pre-conditioners". If accumulation is disabled, setting Lt=GtGt⊤ and Rt=Gt⊤Gt, Shampoo reduces to:
Wt+1=Wt−η⋅GtGt⊤
Shampoo without accumulation projects the gradient matrix to the closest semi-orthogonal matrix in Frobenius norm. For semi-orthogonal matrices Om×n={A∈Rm×n:A⊤A=Im or AA⊤=In} and Frobenius norm ∣∣⋅∣∣F, for any matrix A∈Rm×n with reduced SVD A=UΣV⊤:
X∈Om×nargmin∣∣A−X∣∣F=UV⊤,
where the minimizer UV⊤ is unique if and only if the matrix Σ has full rank.
The paper claims that Shampoo is steepest descent under the maximum spectral norm ∣∣A∣∣ℓ2→ℓ2 over all the matrices in the network.
For gradient matrices G1,...,GL and sharpness λ>0:
ΔW1,...,ΔWLargmin[l=1∑L⟨Gl,ΔWl⟩+2λl=1maxL∣∣ΔWl∣∣ℓ2→ℓ22],
where ⟨⋅,⋅⟩ denotes the Frobenius inner product and ΔWl has the same shape as Gl. If Gl has reduced SVD given by Gl=UlΣlVl⊤ for each l=1,...,L, then the above equation is solved with a step size η=λ1l=1∑LtrΣl and an update:
ΔWl=−η⋅UlVl⊤ for each l=1,...,L.
The paper connects the spectral norm to the square loss. For a matrix W∈Rdout×din (a linear predictor mapping an input x∈Rdin to an output y=Wx∈Rdout), and a dataset of n samples D={(x1,y1),...,(xn,yn)}, where the ith input is normalized such that ∣∣xi∣∣2=din, the square loss is:
L(W)=2n1i=1∑ndout1∣∣yi−Wxi∣∣22.
Then, for any matrix ΔW∈Rdout×din (a weight update):
L(W+ΔW)≤L(W)+⟨∇L(W),ΔW⟩+1⋅doutdin⋅∣∣ΔW∣∣ℓ2→ℓ22,
where ⟨⋅,⋅⟩ is the Frobenius inner product.
The square loss of a linear predictor admits an upper bound that is quadratic in the spectral norm of the weight perturbation. Choosing the weight perturbation to minimize this upper bound is steepest descent under the spectral norm.
The paper claims that Prodigy (without EMA) is another example of steepest descent, where instead of using the step size η=∣∣g∣∣†/λ from the steepest descent proposition, Prodigy uses a heuristic to automatically warm up to a good step size.
With EMA switched off (β1=β2=0), the Prodigy updates simplify dramatically to sign gradient descent with a dynamical step size as follows:
ηt+1=max(ηt,∣∣Gt∣∣1Gt⊤(W0−Wt))
Wt+1=Wt−ηt⋅sign(Gt).
The paper claims that Prodigy without EMA is steepest descent, although with a dynamically chosen step size denoted ηt, and that the dynamical rule approximates a heuristic algorithm for achieving escape velocity. The paper defines escape velocity as choosing a very small initial step size η0, checking at each step if the weights Wt have escaped the linearization of the loss around the initial weights W0, and doubling the step size according to ηt+1=2×ηt if not.
The paper then massages Prodigy's step size update as follows:
ηt+1=max(ηt,∣∣Gt∣∣1Gt⊤(W0−Wt))=max(ηt,∣∣Gt∣∣1∣∣Gt∣∣2×∣∣Wt−W0∣∣2×cosθ),
where θ denotes the angle between the gradient Gt and the difference in weights W0−Wt. The paper then makes two assumptions: the gradient is a "dense" vector in Rn, meaning that ∣∣Gt∣∣2/∣∣Gt∣∣1≈1/n, and Wt is still close enough to the initialization W0 that cosθ≈1. Under these assumptions, the above equation becomes just ηt+1≈max(ηt,∣∣Wt−W0∣∣RMS), where the root mean square (RMS) norm is defined via ∣∣⋅∣∣RMS=n1∣∣⋅∣∣2. Combined with Wt+1=Wt−ηt⋅sign(Gt), this allows the authors to estimate the size of the weight change at step t+1:
∣∣Wt+2−Wt+1∣∣RMS=ηt+1⋅∣∣sign(Gt)∣∣RMS≈max(ηt,∣∣Wt−W0∣∣RMS)≥∣∣Wt−W0∣∣RMS.
The paper then introduces the modular norm and its corresponding steepest descent algorithm. Given scalar coefficients s1,…,sL>0 and norms ∣∣⋅∣∣1,…,∣∣⋅∣∣L, the modular norm is defined as the mapping:
W1,…,WL↦max{s1∣∣W1∣∣1,…,sL∣∣WL∣∣L}.
The corresponding steepest descent problem is given by:
ΔW1,…,ΔWLargmin[l=1∑L⟨Gl,ΔWl⟩+2λl=1maxLsl2∣∣ΔWl∣∣l2],
where ⟨⋅,⋅⟩ denotes the Frobenius inner product, and for each l=1,...,L the two matrices ΔWl and Gl are of the same shape. If the global step size η=λ1k=1∑Lsk1∣∣Gk∣∣k†, then the solution is given by:
ΔWl=−slη⋅∣∣Wl∣∣l=1argmax⟨Gl,Wl⟩ for each layer l=1,...,L.
The paper notes that the ℓ1→ℓp operator norm is the largest ℓp norm of the columns; the ℓp→ℓ∞ operator norm is the largest dual ℓp norm over the rows. For a matrix $A\inR^{m\times n}$ with m rows {rowi(A)}i=1m and n columns {colj(A)}j=1n, and 1≤p≤∞:
∣∣A∣∣ℓ1→ℓp=jmax∣∣colj(A)∣∣p
∣∣A∣∣ℓp→ℓ∞=imax∣∣rowi(A)∣∣p−1p.
The paper concludes that by equipping neural network layers with better norms, it could lead to learning rate transfer across scale.