How Does Critical Batch Size Scale in Pre-training? (2410.21676v4)

Published 29 Oct 2024 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive LLMs, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. Of independent interest, we highlight the importance of common hyper-parameter choices and strategies for studying large-scale pre-training beyond fixed training durations.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates through experiments and theoretical analysis that critical batch size in large language model pre-training scales primarily with data size, not model size.
Experiments on models up to 1.2B parameters using the C4 dataset measured critical batch size by observing steps to reach target validation loss across varied batch sizes and hyperparameters.
The study presents scaling laws showing critical batch size increases with data size, enabling practitioners to optimize pre-training efficiency by adjusting data parallelism based on available data.

Analysis of Critical Batch Size Scaling in Pre-training LLMs

The paper “How Does Critical Batch Size Scale in Pre-training?” addresses a crucial aspect of optimizing large-scale neural networks: the critical batch size (CBS). The authors conduct a thorough exploration of CBS in the context of pre-training autoregressive transformer-based LLMs such as those inspired by the works of Vaswani et al. (2017) and Radford et al. (2018).

Overview and Objectives

The primary goal of this research is to understand the scaling behavior of CBS during neural network pre-training. CBS is a threshold beyond which further increases in batch size yield diminishing returns in computational efficiency. Understanding this threshold allows for more effective data-parallelism strategies, which are essential for harnessing modern hardware accelerators efficiently. The authors emphasize that while larger batch sizes can lead to approximately linear scaling efficiencies initially, going beyond the CBS results in minimal gains, necessitating a balance between computational efficiency and model performance.

Experimental Approach

In seeking to characterize CBS, the authors pre-train a series of LLMs ranging from 85 million to 1.2 billion parameters using the C4 dataset. Through extensive hyper-parameter sweeps that consider batch size, momentum, and learning rates, they aim to separate the effects of model and data size on the growth of CBS. The experimental setup leverages both the Chinchilla compute-optimal framework and the notion of target validation loss. As such, the measure of optimization efficiency and CBS is derived by observing the number of steps required to meet predefined validation loss targets across varied batch sizes.

Key Findings

The authors elucidate several critical insights:

Dependency on Data Size: The results demonstrate that CBS scales primarily with data size rather than model size. Theoretical analysis supports this observation, positing that CBS invariance to model size stems from the infinite-width limits of neural networks and considerations within the framework of least squares regression.
Scaling Laws: The paper presents scaling laws that describe the relationship between CBS and both model and data sizes. It is observed that while increasing model size alone does not significantly affect CBS, scaling up data size consistently leads to an increase in CBS. This suggests that training duration, as seen in the number of processed tokens, is a more substantial driver of CBS than the model's parameter size alone.
CBS and Model Configuration: The findings suggest that while increasing either the depth or width of models similarly affects CBS under compute-optimal configurations, the broader implication for practical applications is that efficiency gains do not necessarily depend on deeper architectures but can be achieved through data size scaling.

Theoretical Insights

The theoretical examination reveals that for fixed model sizes, CBS increases as more data is introduced—a finding supported by an analysis of mini-batch SGD in high-dimensional linear regression. The experiments provide evidence that leveraging scaling laws and CBS understanding can optimize pre-training by adjusting data parallelism, thus enhancing computational efficiency.

Implications and Future Directions

These findings serve both practical and theoretical domains. By clarifying how CBS scales with data size, this research aids in optimizing large-scale pre-training efforts. Practically, this means that given limitations in compute resources, practitioners can adjust data parallelism to maximize training efficiency, especially in real-world scenarios where data accessibility and compute resources vary. Theoretical implications lie in the formalization of CBS and insights into overparameterized models' behaviors.

Potential future work could explore more diverse data sources and configurations where model size/data size ratios exceed those used in Chinchilla-optimal settings. Additionally, understanding CBS in heterogeneous datasets would further enhance strategies for efficient large-model pre-training.

This paper advances our understanding of the practicalities behind model pre-training and optimization, offering insights that are highly relevant given today's expansive use of large autoregressive LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_hanlin_zhang_/status/1860077052587835811

https://twitter.com/ShamKakade6/status/1860052960044286442

https://twitter.com/SeunghyunSEO7/status/1903013694528970856

https://twitter.com/elyxlz/status/1877346268722176374

https://twitter.com/hillbig/status/1860805378939977774

https://twitter.com/dmsobol/status/1925277302608613868