LAMB Optimizer and LAMBC Clipping
- LAMB optimizer is an adaptive method that scales layer-wise updates using a trust ratio to effectively handle large-batch training.
- LAMBC extends LAMB by explicitly clipping trust ratios, which stabilizes training and prevents oversized updates, especially in deep networks.
- Empirical results on CIFAR-10 and downsampled ImageNet show that optimal clipping (μ=1) yields improved convergence and accuracy in large-scale models.
The LAMB (Layer-wise Adaptive Moments optimizer for Batch training) optimizer is an adaptive stochastic optimization method specifically designed to address the challenges of large-batch neural network training. LAMB extends the Adam algorithm with a layer-wise rescaling strategy known as the “trust ratio,” computed as the norm of a layer’s weights divided by the norm of its Adam-style adaptive update. This normalization enables robust convergence and accuracy preservation at large batch sizes, a setting where conventional optimizers frequently degrade in performance. A key variant, LAMBC, incorporates explicit clipping of the trust ratio to further stabilize training by preventing pathological scaling.
1. Mathematical Formulation and Update Rules
Let the network weights at iteration be , with the stochastic gradient , hyper-parameters , step size , weight decay , and small constant . LAMB’s per-step updates proceed as follows:
- First and second moment estimates (Adam-style):
- Bias corrections:
- Adaptive direction:
- Layer-wise partitioning: Partition and by layers .
- Trust ratio for each layer:
- , where (typically the identity).
- Weight update (with optional weight decay):
The trust ratio ensures that the adaptation is layer-wise, addressing variations in the scale of weights and gradients across heterogeneous network architectures.
2. Trust Ratio Clipping in LAMBC
LAMBC introduces explicit bounding of by parameters (lower bound) and (upper bound), yielding a clipped trust ratio:
Empirically, and deliver strong performance, with the upper bound being the principal factor in restraining outsized updates. This modification prevents instability due to extreme trust ratio values, notably when , which would otherwise yield excessively large steps and potential divergence (Fong et al., 2020).
3. Algorithmic Workflow
LAMB and LAMBC maintain the same core workflow, differentiated by the use of trust ratio clipping in LAMBC. The following summarizes the key steps:
| Step | LAMB / LAMBC Operation | Notable Differences |
|---|---|---|
| 1 | Compute per-layer gradients | Same for both |
| 2 | Update moments, bias correction | Same |
| 3 | Compute | Same |
| 4 | Compute trust ratio | Same |
| 5 | LAMBC only: Clip to | Clipping in LAMBC |
| 6 | Update | Same |
The distinction arises in Step 5, where LAMBC applies , directly impacting stability and convergence.
4. Empirical Performance and Hyper-Parameter Effects
Experiments on CIFAR-10 (ResNet-18, 80 epochs, learning rate $1e$-$2$) evidence LAMBC’s advantage over baseline LAMB across various batch sizes:
- Batch 1000: LAMBC@ achieves vs LAMB’s
- Batch 2000: vs
- Batch 3000: vs
On downsampled ImageNet (6464, batch 400), LAMBC consistently yields superior generalization performance over the training epochs. The trust ratio upper bound produces optimal results; increasing to $3$ or $5$ maintains advantage over no clipping but degrades as grows further. This trend suggests that excessive trust ratio flexibility undermines stability and accuracy (Fong et al., 2020).
5. Theoretical and Practical Considerations
- Stability: Unclipped trust ratios in LAMB can occasionally become arbitrarily large, particularly late in training or under large batch conditions. This destabilizes optimization, leading to overshoot or divergence. LAMBC mitigates this by enforcing explicit bounds, thereby facilitating smoother convergence behavior.
- Convergence: LAMBC consistently demonstrates more stable convergence, particularly in regimes of large batch training or when weight norms begin to dominate the adaptive update norm.
- Recommended hyper-parameters: Setting is effective on vision tasks. The lower bound is left at zero; setting may help prevent vanishing updates but is not strictly necessary for typical use cases.
- Scheduling: Allowing a larger during very early epochs and decaying it toward $1$ as training proceeds can sometimes be beneficial. Hyper-parameter tuning is best performed via a short “clip-grid” search, potentially onboarded on a proxy task (Fong et al., 2020).
6. Applicability and Implementation Guidance
LAMB and LAMBC are best suited for large-batch, large-model training regimes, such as BERT or ResNet-50 with batch sizes in the range. The use of explicit trust ratio clipping in LAMBC is recommended to guard against excessive update magnitudes and to promote stable, efficient optimization. Future refinements could involve per-layer or adaptive schedule bounds, but the global scalar clipping provides robust gains with minimal overhead relative to unmodified LAMB. The minimal intervention required to implement LAMBC renders it a practical choice for large-scale deep learning workflows (Fong et al., 2020).