Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lion: Evolved Sign Momentum Optimizer

Updated 2 March 2026
  • Lion is a sign-based stochastic optimization algorithm defined by momentum updates and sign normalization to efficiently train large-scale neural networks.
  • It leverages decoupled weight decay and two momentum sequences to achieve provably optimal convergence under heavy-tailed and centralized/distributed settings.
  • Empirical results reveal Lion outperforms adaptive optimizers like AdamW with reduced memory overhead in vision, language, and diffusion tasks.

Lion (Evolved Sign Momentum) is a sign-based stochastic optimization algorithm developed by symbolic program search, specifically designed for efficient and scalable training of large neural networks. Distinguished by its use of sign normalization with momentum, decoupled weight decay, and extremely low memory overhead, Lion has demonstrated empirical superiority over adaptive optimizers such as AdamW on a diverse array of machine learning tasks, including vision, language, and diffusion models. It achieves provably optimal convergence rates under standard and heavy-tailed noise models, and operates robustly in both centralized and distributed environments.

1. Algorithmic Structure and Update Rules

Lion maintains two momentum sequences and leverages a coordinate-wise sign-based update, incorporating decoupled weight decay to solve unconstrained and \ell_\infty-constrained problems. The canonical form is as follows:

Let xtRdx_t \in \mathbb{R}^d be the parameter vector at iteration tt; gtg_t is the (mini-batch) stochastic gradient.

The update rule is: vt=β1mt1+(1β1)gt mt=β2mt1+(1β2)gt xt+1=xtη[sign(vt)+λxt]\begin{aligned} v_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ m_t &= \beta_2 m_{t-1} + (1-\beta_2) g_t \ x_{t+1} &= x_t - \eta [\, \mathrm{sign}(v_t) + \lambda x_t \,] \end{aligned} where:

  • η\eta is the learning rate,
  • λ\lambda is the decoupled weight decay,
  • β1,β2(0,1)\beta_1, \beta_2 \in (0, 1) are the first and second momentum coefficients.

Simplified "single-momentum" variants use only

v1=g1;vt=(1β)vt1+βgt xt+1=xtηsign(vt)\begin{aligned} v_1 &= g_1; \quad v_t = (1-\beta) v_{t-1} + \beta g_t \ x_{t+1} &= x_t - \eta\, \mathrm{sign}(v_t) \end{aligned}

This asymmetric use of two EMAs is central to Lion's performance for large-scale models (Chen et al., 2023, Yu et al., 7 Feb 2026).

The algorithm applies a coordinate-wise sign (i.e., {±1}\{\pm1\}) to the momentum estimate, producing updates with constant 2\ell_2 norm per parameter, independent of the underlying gradients (Chen et al., 2023).

2. Theoretical Foundations and Optimization Geometry

Lion can be interpreted as a principled solver for constrained and composite optimization problems. Formally, with decoupled weight decay λ>0\lambda > 0, it can be viewed as solving: minxRdf(x)subject tox1/λ\min_{x \in \mathbb{R}^d} f(x) \quad \text{subject to} \quad \|x\|_\infty \leq 1/\lambda More generally, Lion fits into the family of schemes minimizing f(x)+κ(λx)f(x) + \kappa^*(\lambda x), where κ\kappa^* is the convex conjugate of a regularization function κ\kappa (Chen et al., 2023).

In continuous time, Lion corresponds to flow dynamics governed by a Lyapunov function H(x,m)H(x, m): H(x,m)=αf(x)+γλϕ(λx)+c[ϕ(λx)+ϕ(m)λmx]H(x, m) = \alpha f(x) + \frac{\gamma}{\lambda} \phi^*(\lambda x) + c[\phi^*(\lambda x) + \phi(m) - \lambda m^\top x] where ϕ(x)=x1\phi(x) = \|x\|_1 and ϕ(y)\phi^*(y) is its convex conjugate (indicator for y1\|y\|_\infty \leq 1). Weight decay enforces contraction of iterates to the \ell_\infty-ball; monotonicity arises from the Lyapunov descent (Chen et al., 2023).

The role of the sign operator is to effect updates in the \ell_\infty geometry, which exhibits increased robustness to heavy gradient noise and ensures non-vanishing update magnitudes even when gradients are sparse or bursty.

3. Convergence Analysis and Rates

Lion exhibits optimal convergence rates in the stochastic nonconvex setting. Under LL-smoothness and bounded-variance stochastic gradients, and for TT iterations, the main results are:

  • Centralized Setting:

1Tt=1TEf(xt)1=O(d1/2T1/4)\frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(x_t)\|_1 = O(d^{1/2} T^{-1/4})

where dd is the parameter count (Jiang et al., 17 Aug 2025, Jiang et al., 16 Jul 2025, Sun et al., 2023). This matches the lower bound for 2\ell_2-norm convergence rate up to the 1/2\ell_1/\ell_2 norm gap (d\sqrt{d}) (Jiang et al., 17 Aug 2025).

  • Variance Reduced Variant (Lion-VR):

1Tt=1TEf(xt)1=O(d1/2T1/3)\frac{1}{T} \sum_{t=1}^T \mathbb{E} \| \nabla f(x_t) \|_1 = O(d^{1/2} T^{-1/3})

via STORM-style gradient difference correction (Jiang et al., 17 Aug 2025).

  • Distributed Setting (nn nodes):

1TtEf(xt)1=O(d1/2(nT)1/4)(classic Lion)\frac{1}{T}\sum_t \mathbb{E}\|\nabla f(x_t)\|_1 = O(d^{1/2} (nT)^{-1/4}) \quad \text{(classic Lion)}

and O(d1/2(nT)1/3)O(d^{1/2}(nT)^{-1/3}) for Lion-VR (Jiang et al., 17 Aug 2025).

  • Communication-Efficient 1-Bit Compression:

Rates degrade gently, e.g., to O(max{d1/4T1/4,d1/10(nT)1/5})O(\max\{d^{1/4} T^{-1/4}, d^{1/10} (nT)^{-1/5}\}) when unbiased sign compression is used in both communication directions (Jiang et al., 17 Aug 2025, Jiang et al., 16 Jul 2025).

Importantly, these rates hold under much weaker regularity assumptions than classical methods—only requiring local (weak) first- and second-order smoothness rather than global Lipschitz (Sun et al., 2023).

Under a generalized heavy-tailed noise model (with moment exponent p<2p<2), Lion attains minimax-optimal stationarity rates and is robust to fat-tailed noise observed in LLMs (Yu et al., 7 Feb 2026).

4. Practical Properties and Implementation Recommendations

Memory efficiency is a primary strength: Lion maintains only two (or even one, in simplified scenarios) momentum vectors, compared to the two-moment structure of AdamW. Its computation per step is simple (no division or square root), yielding marked reductions in memory footprint and improved throughput (Chen et al., 2023).

Key hyperparameter regimes are:

  • ηd1/2T3/4\eta \sim d^{-1/2} T^{-3/4} (standard) or d1/2T2/3d^{-1/2} T^{-2/3} (variance-reduced) (Jiang et al., 17 Aug 2025, Jiang et al., 16 Jul 2025)
  • Momentum coefficients: default β1=0.9,β2=0.99\beta_1 = 0.9, \beta_2 = 0.99, with tuning toward higher values (e.g., β1=0.95,β2=0.98\beta_1=0.95, \beta_2=0.98) for improved stability in large-scale LMs (Chen et al., 2023)
  • Weight decay λ\lambda set so that ηλ\eta \lambda aligns with effective regularization; typically λ3\lambda \sim 3--10×10\times larger than in AdamW (Chen et al., 2023).

Lion's larger per-step update norm (due to sign normalization) requires learning rates $3$–10×10\times smaller than AdamW to match effective step magnitude. Warm-up and cosine decay scheduling are standard (Chen et al., 2023).

5. Empirical Performance and Benchmarks

Lion surpasses or matches AdamW and Adafactor across domains:

  • ImageNet Classification (ViT and CoAtNet): up to +2%+2\% top-1 accuracy and $3$–5×5\times reduction in compute or steps (Chen et al., 2023).
  • Vision-Language Contrastive Learning: +2% in zero-shot ImageNet, with strong gains in transfer datasets (Chen et al., 2023).
  • Diffusion Models: Reaches FID 4.7 with 2.3×2.3\times fewer steps on ImageNet-256 (Chen et al., 2023).
  • Language Modeling: Reduces perplexity and training step count by $1.5$–2×2\times; exhibits best-in-class performance on nanoGPT pretraining under heavy-tailed gradient noise (Yu et al., 7 Feb 2026).
  • Production Deployment: Used in Google Search Ads CTR model, leveraging bfloat16 momentum for further memory efficiency (Chen et al., 2023).

Large-batch performance is a distinguishing feature; Lion's accuracy gains scale with increasing batch size, exceeding AdamW especially at batch sizes 4096\geq 4096 (Chen et al., 2023).

In distributed experiments using unbiased-sign compression, loss and accuracy remain competitive while greatly reducing communication bandwidth (Jiang et al., 17 Aug 2025, Jiang et al., 16 Jul 2025).

6. Theoretical Insights and Generalizations

Lion's update rule emerges naturally from a discrete Lyapunov framework, enforcing descent on a surrogate energy function encoding both loss and regularization constraints (Chen et al., 2023). This interpretation enables principled extensions:

  • Replacement of the sign operator by a general subgradient κ\nabla \kappa to yield Lion-κ\kappa schemes targeting composite objectives (e.g., group norm, sparsity, entropy, Huber-type regularization) (Chen et al., 2023).
  • Stability and convergence without any need for global gradient or Hessian Lipschitz constants; i.e., it remains effective even under unbounded curvature far from optima (Sun et al., 2023).

The sign-based, \ell_\infty-geometry update is adaptively robust to both heavy-tailed and outlier-prone noise, with non-Euclidean concentration effects yielding tighter 1\ell_1-norm stationarity (Yu et al., 7 Feb 2026). Lion's performance is thus particularly pronounced when gradient statistics deviate strongly from Gaussianity.

7. Limitations and Deployment Considerations

Lion’s improvements are marginal or statistically insignificant in some regimes:

  • Convolutional networks (e.g., ResNet-50) show little gain relative to AdamW (Chen et al., 2023).
  • In extremely small batch size (<64<64), or when data quality is extremely high, the relative benefits diminish.
  • Stronger augmentation or intrinsic robustness of the loss landscape may also dampen the effect.

Grid-search over β{0.9,0.5,0.1,0.01}\beta\in\{0.9,0.5,0.1,0.01\} and η\eta in [5×103,1×104][5\times10^{-3}, 1\times10^{-4}] is advised for empirical tuning (Jiang et al., 16 Jul 2025).

A plausible implication is that practitioners should scale Lion’s learning rate and weight decay differently than AdamW, and verify stability particularly in new architectures or under extreme data or batch size regimes.


Key References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lion (Evolved Sign Momentum).