- The paper demonstrates through experiments and theoretical analysis that critical batch size in large language model pre-training scales primarily with data size, not model size.
- Experiments on models up to 1.2B parameters using the C4 dataset measured critical batch size by observing steps to reach target validation loss across varied batch sizes and hyperparameters.
- The study presents scaling laws showing critical batch size increases with data size, enabling practitioners to optimize pre-training efficiency by adjusting data parallelism based on available data.
Analysis of Critical Batch Size Scaling in Pre-training LLMs
The paper “How Does Critical Batch Size Scale in Pre-training?” addresses a crucial aspect of optimizing large-scale neural networks: the critical batch size (CBS). The authors conduct a thorough exploration of CBS in the context of pre-training autoregressive transformer-based LLMs such as those inspired by the works of Vaswani et al. (2017) and Radford et al. (2018).
Overview and Objectives
The primary goal of this research is to understand the scaling behavior of CBS during neural network pre-training. CBS is a threshold beyond which further increases in batch size yield diminishing returns in computational efficiency. Understanding this threshold allows for more effective data-parallelism strategies, which are essential for harnessing modern hardware accelerators efficiently. The authors emphasize that while larger batch sizes can lead to approximately linear scaling efficiencies initially, going beyond the CBS results in minimal gains, necessitating a balance between computational efficiency and model performance.
Experimental Approach
In seeking to characterize CBS, the authors pre-train a series of LLMs ranging from 85 million to 1.2 billion parameters using the C4 dataset. Through extensive hyper-parameter sweeps that consider batch size, momentum, and learning rates, they aim to separate the effects of model and data size on the growth of CBS. The experimental setup leverages both the Chinchilla compute-optimal framework and the notion of target validation loss. As such, the measure of optimization efficiency and CBS is derived by observing the number of steps required to meet predefined validation loss targets across varied batch sizes.
Key Findings
The authors elucidate several critical insights:
- Dependency on Data Size: The results demonstrate that CBS scales primarily with data size rather than model size. Theoretical analysis supports this observation, positing that CBS invariance to model size stems from the infinite-width limits of neural networks and considerations within the framework of least squares regression.
- Scaling Laws: The paper presents scaling laws that describe the relationship between CBS and both model and data sizes. It is observed that while increasing model size alone does not significantly affect CBS, scaling up data size consistently leads to an increase in CBS. This suggests that training duration, as seen in the number of processed tokens, is a more substantial driver of CBS than the model's parameter size alone.
- CBS and Model Configuration: The findings suggest that while increasing either the depth or width of models similarly affects CBS under compute-optimal configurations, the broader implication for practical applications is that efficiency gains do not necessarily depend on deeper architectures but can be achieved through data size scaling.
Theoretical Insights
The theoretical examination reveals that for fixed model sizes, CBS increases as more data is introduced—a finding supported by an analysis of mini-batch SGD in high-dimensional linear regression. The experiments provide evidence that leveraging scaling laws and CBS understanding can optimize pre-training by adjusting data parallelism, thus enhancing computational efficiency.
Implications and Future Directions
These findings serve both practical and theoretical domains. By clarifying how CBS scales with data size, this research aids in optimizing large-scale pre-training efforts. Practically, this means that given limitations in compute resources, practitioners can adjust data parallelism to maximize training efficiency, especially in real-world scenarios where data accessibility and compute resources vary. Theoretical implications lie in the formalization of CBS and insights into overparameterized models' behaviors.
Potential future work could explore more diverse data sources and configurations where model size/data size ratios exceed those used in Chinchilla-optimal settings. Additionally, understanding CBS in heterogeneous datasets would further enhance strategies for efficient large-model pre-training.
This paper advances our understanding of the practicalities behind model pre-training and optimization, offering insights that are highly relevant given today's expansive use of large autoregressive LLMs.