Beyond neural scaling laws: beating power law scaling via data pruning (2206.14486v6)

Published 29 Jun 2022 in cs.LG, cs.AI, cs.CV, and stat.ML

Abstract: Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how in theory we can break beyond power law scaling and potentially even reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this improved scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling in practice on ResNets trained on CIFAR-10, SVHN, and ImageNet. Next, given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.

PDF Abstract

Breaking Beyond Neural Scaling Laws: Enhanced Learning via Data Pruning

The research presented in this paper explores overcoming the limitations posed by widely observed neural scaling laws in deep learning, particularly those related to dataset size. These scaling laws traditionally dictate that as the dataset size increases, the error rate decreases following a power law. While this has prompted advancements in deep learning, it also incurs substantial costs in computational power and energy consumption. This paper emphasizes the potential of data pruning as a strategy to move beyond these conventional limitations, proposing a method to achieve superior performance scaling, possibly even exponential, by intelligently reducing the dataset.

High-Quality Data Pruning Metric

A key proposition is that with access to an excellent data pruning metric that effectively ranks training examples according to importance, it is feasible to maintain, or even improve, model performance while using a smaller dataset. This research suggests that doing so could lead to exponential improvements in scaling error with pruned dataset size. Notably, empirical tests confirm the potential of surpassing power law scaling using pruned datasets on standard benchmarks like CIFAR-10, SVHN, and ImageNet, employing architectures such as ResNets and Vision Transformers.

Benchmarking and New Pruning Metrics

Given the substantial role of high-quality metrics, a large-scale benchmarking paper was conducted to assess ten distinct data pruning metrics on ImageNet. The paper concluded that while current high-performing metrics do not scale efficiently to ImageNet, the best-performing ones demand extensive computation and rely on availability of labels. In response, this research introduces a straightforward, efficient, and scalable self-supervised pruning metric. Remarkably, this simple metric is on par with the leading supervised metrics in terms of performance, despite its reliance on unsupervised learning principles and bypassing the need for label information.

Theoretical Insights and Practical Implications

The paper further leverages statistical mechanics to derive an analytic theory of data pruning, especially in a student-teacher perceptron learning context. The theory makes several significant predictions: optimal pruning strategies vary based on initial data availability, performing better with 'easier' examples when data is scarce, and 'harder' ones when data is abundant. Crucially, aggressive pruning from larger datasets can achieve an error decay that surpasses power laws, indicative of exponential scaling. This theoretical foundation aligns well with empirical results, providing insights into information gain per example and the intricacies of selecting an optimal pruning strategy.

Future Trajectories

While this paper establishes a promising foundation, future research is vital to refining pruning metrics further, particularly self-supervised ones. Such advancements could herald substantial reductions in resource costs associated with large-scale deep learning, forging pathways towards intelligent dataset curation, potentially termed "foundation datasets." Such datasets could be instrumental for training multiple models, thereby maximizing the utility of the initial pruning effort.

In summarizing this compelling research, it's evident that while challenges remain, the insights and methodologies proposed hold significant potential for reshaping how datasets are utilized in deep learning. The ability to do more with less—achieving higher performance with smaller datasets through intelligent pruning—promises both theoretical and practical advancements in the field of artificial intelligence.