Lion: Evolved Sign Momentum Optimizer

Updated 2 March 2026

Lion is a sign-based stochastic optimization algorithm defined by momentum updates and sign normalization to efficiently train large-scale neural networks.
It leverages decoupled weight decay and two momentum sequences to achieve provably optimal convergence under heavy-tailed and centralized/distributed settings.
Empirical results reveal Lion outperforms adaptive optimizers like AdamW with reduced memory overhead in vision, language, and diffusion tasks.

Lion (Evolved Sign Momentum) is a sign-based stochastic optimization algorithm developed by symbolic program search, specifically designed for efficient and scalable training of large neural networks. Distinguished by its use of sign normalization with momentum, decoupled weight decay, and extremely low memory overhead, Lion has demonstrated empirical superiority over adaptive optimizers such as AdamW on a diverse array of machine learning tasks, including vision, language, and diffusion models. It achieves provably optimal convergence rates under standard and heavy-tailed noise models, and operates robustly in both centralized and distributed environments.

1. Algorithmic Structure and Update Rules

Lion maintains two momentum sequences and leverages a coordinate-wise sign-based update, incorporating decoupled weight decay to solve unconstrained and $\ell_\infty$ -constrained problems. The canonical form is as follows:

Let $x_t \in \mathbb{R}^d$ be the parameter vector at iteration $t$ ; $g_t$ is the (mini-batch) stochastic gradient.

The update rule is: $\begin{aligned} v_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ m_t &= \beta_2 m_{t-1} + (1-\beta_2) g_t \ x_{t+1} &= x_t - \eta [\, \mathrm{sign}(v_t) + \lambda x_t \,] \end{aligned}$ where:

$\eta$ is the learning rate,
$\lambda$ is the decoupled weight decay,
$\beta_1, \beta_2 \in (0, 1)$ are the first and second momentum coefficients.

Simplified "single-momentum" variants use only

$\begin{aligned} v_1 &= g_1; \quad v_t = (1-\beta) v_{t-1} + \beta g_t \ x_{t+1} &= x_t - \eta\, \mathrm{sign}(v_t) \end{aligned}$

This asymmetric use of two EMAs is central to Lion's performance for large-scale models (Chen et al., 2023, Yu et al., 7 Feb 2026).

The algorithm applies a coordinate-wise sign (i.e., $\{\pm1\}$ ) to the momentum estimate, producing updates with constant $\ell_2$ norm per parameter, independent of the underlying gradients (Chen et al., 2023).

2. Theoretical Foundations and Optimization Geometry

Lion can be interpreted as a principled solver for constrained and composite optimization problems. Formally, with decoupled weight decay $\lambda > 0$ , it can be viewed as solving: $\min_{x \in \mathbb{R}^d} f(x) \quad \text{subject to} \quad \|x\|_\infty \leq 1/\lambda$ More generally, Lion fits into the family of schemes minimizing $f(x) + \kappa^*(\lambda x)$ , where $\kappa^*$ is the convex conjugate of a regularization function $\kappa$ (Chen et al., 2023).

In continuous time, Lion corresponds to flow dynamics governed by a Lyapunov function $H(x, m)$ : $H(x, m) = \alpha f(x) + \frac{\gamma}{\lambda} \phi^*(\lambda x) + c[\phi^*(\lambda x) + \phi(m) - \lambda m^\top x]$ where $\phi(x) = \|x\|_1$ and $\phi^*(y)$ is its convex conjugate (indicator for $\|y\|_\infty \leq 1$ ). Weight decay enforces contraction of iterates to the $\ell_\infty$ -ball; monotonicity arises from the Lyapunov descent (Chen et al., 2023).

The role of the sign operator is to effect updates in the $\ell_\infty$ geometry, which exhibits increased robustness to heavy gradient noise and ensures non-vanishing update magnitudes even when gradients are sparse or bursty.

3. Convergence Analysis and Rates

Lion exhibits optimal convergence rates in the stochastic nonconvex setting. Under $L$ -smoothness and bounded-variance stochastic gradients, and for $T$ iterations, the main results are:

Centralized Setting:

$\frac{1}{T}\sum_{t=1}^T \mathbb{E}\|\nabla f(x_t)\|_1 = O(d^{1/2} T^{-1/4})$

where $d$ is the parameter count (Jiang et al., 17 Aug 2025, Jiang et al., 16 Jul 2025, Sun et al., 2023). This matches the lower bound for $\ell_2$ -norm convergence rate up to the $\ell_1/\ell_2$ norm gap ( $\sqrt{d}$ ) (Jiang et al., 17 Aug 2025).

Variance Reduced Variant (Lion-VR):

$\frac{1}{T} \sum_{t=1}^T \mathbb{E} \| \nabla f(x_t) \|_1 = O(d^{1/2} T^{-1/3})$

via STORM-style gradient difference correction (Jiang et al., 17 Aug 2025).

Distributed Setting ( $n$ nodes):

$\frac{1}{T}\sum_t \mathbb{E}\|\nabla f(x_t)\|_1 = O(d^{1/2} (nT)^{-1/4}) \quad \text{(classic Lion)}$

and $O(d^{1/2}(nT)^{-1/3})$ for Lion-VR (Jiang et al., 17 Aug 2025).

Communication-Efficient 1-Bit Compression:

Rates degrade gently, e.g., to $O(\max\{d^{1/4} T^{-1/4}, d^{1/10} (nT)^{-1/5}\})$ when unbiased sign compression is used in both communication directions (Jiang et al., 17 Aug 2025, Jiang et al., 16 Jul 2025).

Importantly, these rates hold under much weaker regularity assumptions than classical methods—only requiring local (weak) first- and second-order smoothness rather than global Lipschitz (Sun et al., 2023).

Under a generalized heavy-tailed noise model (with moment exponent $p<2$ ), Lion attains minimax-optimal stationarity rates and is robust to fat-tailed noise observed in LLMs (Yu et al., 7 Feb 2026).

4. Practical Properties and Implementation Recommendations

Memory efficiency is a primary strength: Lion maintains only two (or even one, in simplified scenarios) momentum vectors, compared to the two-moment structure of AdamW. Its computation per step is simple (no division or square root), yielding marked reductions in memory footprint and improved throughput (Chen et al., 2023).

Key hyperparameter regimes are:

$\eta \sim d^{-1/2} T^{-3/4}$ (standard) or $d^{-1/2} T^{-2/3}$ (variance-reduced) (Jiang et al., 17 Aug 2025, Jiang et al., 16 Jul 2025)
Momentum coefficients: default $\beta_1 = 0.9, \beta_2 = 0.99$ , with tuning toward higher values (e.g., $\beta_1=0.95, \beta_2=0.98$ ) for improved stability in large-scale LMs (Chen et al., 2023)
Weight decay $\lambda$ set so that $\eta \lambda$ aligns with effective regularization; typically $\lambda \sim 3$ -- $10\times$ larger than in AdamW (Chen et al., 2023).

Lion's larger per-step update norm (due to sign normalization) requires learning rates $3$– $10\times$ smaller than AdamW to match effective step magnitude. Warm-up and cosine decay scheduling are standard (Chen et al., 2023).

5. Empirical Performance and Benchmarks

Lion surpasses or matches AdamW and Adafactor across domains:

ImageNet Classification (ViT and CoAtNet): up to $+2\%$ top-1 accuracy and $3$– $5\times$ reduction in compute or steps (Chen et al., 2023).
Vision-Language Contrastive Learning: +2% in zero-shot ImageNet, with strong gains in transfer datasets (Chen et al., 2023).
Diffusion Models: Reaches FID 4.7 with $2.3\times$ fewer steps on ImageNet-256 (Chen et al., 2023).
Language Modeling: Reduces perplexity and training step count by $1.5$– $2\times$ ; exhibits best-in-class performance on nanoGPT pretraining under heavy-tailed gradient noise (Yu et al., 7 Feb 2026).
Production Deployment: Used in Google Search Ads CTR model, leveraging bfloat16 momentum for further memory efficiency (Chen et al., 2023).

Large-batch performance is a distinguishing feature; Lion's accuracy gains scale with increasing batch size, exceeding AdamW especially at batch sizes $\geq 4096$ (Chen et al., 2023).

In distributed experiments using unbiased-sign compression, loss and accuracy remain competitive while greatly reducing communication bandwidth (Jiang et al., 17 Aug 2025, Jiang et al., 16 Jul 2025).

6. Theoretical Insights and Generalizations

Lion's update rule emerges naturally from a discrete Lyapunov framework, enforcing descent on a surrogate energy function encoding both loss and regularization constraints (Chen et al., 2023). This interpretation enables principled extensions:

Replacement of the sign operator by a general subgradient $\nabla \kappa$ to yield Lion- $\kappa$ schemes targeting composite objectives (e.g., group norm, sparsity, entropy, Huber-type regularization) (Chen et al., 2023).
Stability and convergence without any need for global gradient or Hessian Lipschitz constants; i.e., it remains effective even under unbounded curvature far from optima (Sun et al., 2023).

The sign-based, $\ell_\infty$ -geometry update is adaptively robust to both heavy-tailed and outlier-prone noise, with non-Euclidean concentration effects yielding tighter $\ell_1$ -norm stationarity (Yu et al., 7 Feb 2026). Lion's performance is thus particularly pronounced when gradient statistics deviate strongly from Gaussianity.

7. Limitations and Deployment Considerations

Lion’s improvements are marginal or statistically insignificant in some regimes:

Convolutional networks (e.g., ResNet-50) show little gain relative to AdamW (Chen et al., 2023).
In extremely small batch size ( $<64$ ), or when data quality is extremely high, the relative benefits diminish.
Stronger augmentation or intrinsic robustness of the loss landscape may also dampen the effect.

Grid-search over $\beta\in\{0.9,0.5,0.1,0.01\}$ and $\eta$ in $[5\times10^{-3}, 1\times10^{-4}]$ is advised for empirical tuning (Jiang et al., 16 Jul 2025).

A plausible implication is that practitioners should scale Lion’s learning rate and weight decay differently than AdamW, and verify stability particularly in new architectures or under extreme data or batch size regimes.

Key References:

Symbolic Discovery and Benchmarking: (Chen et al., 2023)
Theoretical Lyapunov and Composite View: (Chen et al., 2023)
Weak Smoothness and Acceleration: (Sun et al., 2023)
Centralized/Distributed Analysis: (Jiang et al., 17 Aug 2025, Jiang et al., 16 Jul 2025)
Heavy-Tailed Noise Theory: (Yu et al., 7 Feb 2026)