Origin of the U-shaped performance vs. batch size in LLM pretraining

Determine why evaluation loss exhibits a U-shaped dependence on batch size when training decoder-only Transformer language models to Chinchilla-optimality on large text corpora (e.g., C4), and identify the trade-offs that govern the choice of optimal batch size for such large-scale pretraining.

Background

In generalization-centric settings (e.g., ImageNet), small batch sizes often improve test performance via implicit regularization, despite worse training loss. The paper investigates whether this conventional wisdom transfers to LLM pretraining and finds that, after learning-rate searches across batch sizes for 19M and 151M parameter decoder-only Transformers trained to Chinchilla-optimality on C4, the best evaluation loss as a function of batch size follows a U-shaped curve. Both excessively small and excessively large batch sizes lead to suboptimal performance.

This observation challenges the straightforward application of small-batch-favoring heuristics to LLM pretraining and motivates a deeper understanding of optimization noise and efficiency trade-offs that determine the optimal batch size at scale.

References

The observed U-shaped relationship between loss and batch size raises interesting questions. Why does this U-shape occur, and what trade-offs determine the optimal batch size? These questions remain open for further investigation and could provide valuable insights into improving the efficiency of large-scale LLM training.

— Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling (2409.15156 - Xiao, 23 Sep 2024) in Subsection "Does Small Batch Size Perform Better?" (Section "Is Regularization Needed?")

Origin of the U-shaped performance vs. batch size in LLM pretraining

Background

References

Related Problems