Deep Learning Scaling is Predictable, Empirically (1712.00409v1)

Published 1 Dec 2017 in cs.LG and stat.ML

Abstract: Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, LLMing, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents---the "steepness" of the learning curve---yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.

PDF Abstract

Predictable Scaling in Deep Learning

This paper conducts a comprehensive empirical analysis to establish the predictability of generalization error and model size scaling in deep learning (DL) across several prominent domains: machine translation, LLMing, image classification, and speech recognition. The paper's crux is the discovery of power-law relationships governing these scaling phenomena, offering quantitative insights into how training set size and computational scale influence model performance.

Key Empirical Findings

Power-Law Generalization Error: The research identifies that generalization error follows a power-law decrease as training set size increases across multiple application domains. The power-law exponents, however, diverge from theoretical predictions, commonly ranging from $-0.07$ to $-0.35$ , whereas prior theoretical work often suggests exponents of $-0.5$ or $-1$ .
Model Size Scaling: Models' parameter count increases sublinearly with training data volume. Sublinear growth indicates that while larger datasets require more extensive models, the escalation in model size is restrained relative to data expansion.
Domain-Specific Exponents: Though power-law behaviors exist universally, different domains exhibit variant exponents. For instance, LLMing shows especially small exponents ( $\beta_g \approx -0.07$ ), suggesting significant data requirements for modest accuracy improvements.

Methodology

The authors employ a rigorous methodology involving the training of state-of-the-art (SOTA) models across reduced sizes to observe these scaling relationships. This approach spans a considerable temporal domain, aggregating approximately 50 years of GPU time, underscoring the thorough computational dedication involved.

Implications

The findings bear considerable implications:

Model Debugging: The scalability and predictability of learning curves can assist in identifying model deficiencies and inadequacies in hyperparameter settings, offering a template for diagnosing and iterating on DL models.
Data and Computational Planning: Understanding scaling behaviors allows for better planning regarding dataset growth and computational resource allocation, aiding in the anticipation of infrastructure needs alongside accuracy aspirations.
Theoretical Exploration: The empirical divergence from theoretical predictions invites further theoretical inquiry to comprehend these unaccounted scaling exponents and their underpinning factors.

Speculations on Future Developments

The paper raises pertinent questions about the persistent nature of power-law exponents and the potential for model innovations to alter these scaling relations. It hints at new research directions aimed at enhancing data efficiency and model learning capabilities, perhaps by exploring innovative architectures or enrichment strategies.

Conclusion

In providing an empirical framework for scaling laws in deep learning, this paper contributes significantly to both practical and theoretical domains. While grounded in the empirical, the insights gleaned provoke discussions that extend into the theoretical exploration and refinement of understanding in AI model scaling. Consequently, this work lays a foundation for future research endeavors poised to bridge the observed empirical gaps with robust theoretical underpinnings.