Papers
Topics
Authors
Recent
Search
2000 character limit reached

LAMB Optimizer and LAMBC Clipping

Updated 3 February 2026
  • LAMB optimizer is an adaptive method that scales layer-wise updates using a trust ratio to effectively handle large-batch training.
  • LAMBC extends LAMB by explicitly clipping trust ratios, which stabilizes training and prevents oversized updates, especially in deep networks.
  • Empirical results on CIFAR-10 and downsampled ImageNet show that optimal clipping (μ=1) yields improved convergence and accuracy in large-scale models.

The LAMB (Layer-wise Adaptive Moments optimizer for Batch training) optimizer is an adaptive stochastic optimization method specifically designed to address the challenges of large-batch neural network training. LAMB extends the Adam algorithm with a layer-wise rescaling strategy known as the “trust ratio,” computed as the norm of a layer’s weights divided by the norm of its Adam-style adaptive update. This normalization enables robust convergence and accuracy preservation at large batch sizes, a setting where conventional optimizers frequently degrade in performance. A key variant, LAMBC, incorporates explicit clipping of the trust ratio to further stabilize training by preventing pathological scaling.

1. Mathematical Formulation and Update Rules

Let the network weights at iteration tt be wtRdw_t\in\mathbb{R}^d, with the stochastic gradient gt=L(wt)g_t=\nabla L(w_t), hyper-parameters β1,β2(0,1)\beta_1,\beta_2\in(0,1), step size ηt\eta_t, weight decay λ\lambda, and small constant ϵ>0\epsilon>0. LAMB’s per-step updates proceed as follows:

  1. First and second moment estimates (Adam-style):
    • mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
    • vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
  2. Bias corrections:
    • m^t=mt1β1t\hat{m}_t = \dfrac{m_t}{1-\beta_1^t}
    • v^t=vt1β2t\hat{v}_t = \dfrac{v_t}{1-\beta_2^t}
  3. Adaptive direction:
    • rt=m^tv^t+ϵr_t = \dfrac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
  4. Layer-wise partitioning: Partition wtw_t and rtr_t by layers i=1,,hi=1,\dots,h.
  5. Trust ratio for each layer:
    • γt(i)=ϕ(wt(i))rt(i)\gamma^{(i)}_t = \dfrac{\phi(\|w^{(i)}_t\|)}{\|r^{(i)}_t\|}, where ϕ(x)=x\phi(x)=x (typically the identity).
  6. Weight update (with optional weight decay):
    • wt+1(i)=wt(i)ηtγt(i)(rt(i)+λwt(i))w^{(i)}_{t+1} = w^{(i)}_t - \eta_t \gamma^{(i)}_t \bigl(r^{(i)}_t + \lambda w^{(i)}_t\bigr)

The trust ratio ensures that the adaptation is layer-wise, addressing variations in the scale of weights and gradients across heterogeneous network architectures.

2. Trust Ratio Clipping in LAMBC

LAMBC introduces explicit bounding of γt(i)\gamma^{(i)}_t by parameters τ\tau (lower bound) and μ\mu (upper bound), yielding a clipped trust ratio:

γ~t(i)=min(max(γt(i),τ),μ)\tilde\gamma^{(i)}_t = \min \big( \max(\gamma^{(i)}_t, \tau), \mu \big)

Empirically, τ=0\tau=0 and μ=1\mu=1 deliver strong performance, with the upper bound μ\mu being the principal factor in restraining outsized updates. This modification prevents instability due to extreme trust ratio values, notably when wtrt\|w_t\| \gg \|r_t\|, which would otherwise yield excessively large steps and potential divergence (Fong et al., 2020).

3. Algorithmic Workflow

LAMB and LAMBC maintain the same core workflow, differentiated by the use of trust ratio clipping in LAMBC. The following summarizes the key steps:

Step LAMB / LAMBC Operation Notable Differences
1 Compute per-layer gradients Same for both
2 Update moments, bias correction Same
3 Compute rt(i)r^{(i)}_t Same
4 Compute trust ratio γt(i)\gamma^{(i)}_t Same
5 LAMBC only: Clip γt(i)\gamma^{(i)}_t to [0,μ][0, \mu] Clipping in LAMBC
6 Update wt+1(i)w^{(i)}_{t+1} Same

The distinction arises in Step 5, where LAMBC applies γt(i)min(max(γt(i),0),μ)\gamma^{(i)}_t \leftarrow \min(\max(\gamma^{(i)}_t, 0), \mu), directly impacting stability and convergence.

4. Empirical Performance and Hyper-Parameter Effects

Experiments on CIFAR-10 (ResNet-18, 80 epochs, learning rate $1e$-$2$) evidence LAMBC’s advantage over baseline LAMB across various batch sizes:

  • Batch 1000: LAMBC@μ=1\mu{=}1 achieves 87.71%87.71\% vs LAMB’s 85.68%85.68\% (+2.03%)(+2.03\%)
  • Batch 2000: 87.30%87.30\% vs 86.61%86.61\% (+0.69%)(+0.69\%)
  • Batch 3000: 86.29%86.29\% vs 85.41%85.41\% (+0.88%)(+0.88\%)

On downsampled ImageNet (64×\times64, batch 400), LAMBC consistently yields superior generalization performance over the training epochs. The trust ratio upper bound μ=1\mu=1 produces optimal results; increasing μ\mu to $3$ or $5$ maintains advantage over no clipping but degrades as μ\mu grows further. This trend suggests that excessive trust ratio flexibility undermines stability and accuracy (Fong et al., 2020).

5. Theoretical and Practical Considerations

  • Stability: Unclipped trust ratios in LAMB can occasionally become arbitrarily large, particularly late in training or under large batch conditions. This destabilizes optimization, leading to overshoot or divergence. LAMBC mitigates this by enforcing explicit bounds, thereby facilitating smoother convergence behavior.
  • Convergence: LAMBC consistently demonstrates more stable convergence, particularly in regimes of large batch training or when weight norms begin to dominate the adaptive update norm.
  • Recommended hyper-parameters: Setting μ1\mu \approx 1 is effective on vision tasks. The lower bound τ\tau is left at zero; setting τ>0\tau > 0 may help prevent vanishing updates but is not strictly necessary for typical use cases.
  • Scheduling: Allowing a larger μ\mu during very early epochs and decaying it toward $1$ as training proceeds can sometimes be beneficial. Hyper-parameter tuning is best performed via a short “clip-grid” search, potentially onboarded on a proxy task (Fong et al., 2020).

6. Applicability and Implementation Guidance

LAMB and LAMBC are best suited for large-batch, large-model training regimes, such as BERT or ResNet-50 with batch sizes in the 8K32K8\mathrm{K}–32\mathrm{K} range. The use of explicit trust ratio clipping in LAMBC is recommended to guard against excessive update magnitudes and to promote stable, efficient optimization. Future refinements could involve per-layer or adaptive schedule bounds, but the global scalar clipping provides robust gains with minimal overhead relative to unmodified LAMB. The minimal intervention required to implement LAMBC renders it a practical choice for large-scale deep learning workflows (Fong et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LAMB Optimizer.