Origin of the U-shaped performance vs. batch size in LLM pretraining
Determine why evaluation loss exhibits a U-shaped dependence on batch size when training decoder-only Transformer language models to Chinchilla-optimality on large text corpora (e.g., C4), and identify the trade-offs that govern the choice of optimal batch size for such large-scale pretraining.
References
The observed U-shaped relationship between loss and batch size raises interesting questions. Why does this U-shape occur, and what trade-offs determine the optimal batch size? These questions remain open for further investigation and could provide valuable insights into improving the efficiency of large-scale LLM training.
                — Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling
                
                (2409.15156 - Xiao, 23 Sep 2024) in Subsection "Does Small Batch Size Perform Better?" (Section "Is Regularization Needed?")