Scaling description of generalization with number of parameters in deep learning (1901.01608v5)

Published 6 Jan 2019 in cond-mat.dis-nn and cs.LG

Abstract: Supervised deep learning involves the training of neural networks with a large number $N$ of parameters. For large enough $N$, in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as $N$ grows past a certain threshold $N^{*}$. Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with $N$. We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations $|f_{N}-\bar{f}{N}|\sim N^{-1/4}$ of the neural net output function $f{N}$ around its expectation $\bar{f}{N}$. These affect the generalization error $\epsilon{N}$ for classification: under natural assumptions, it decays to a plateau value $\epsilon_{\infty}$ in a power-law fashion $\sim N^{-1/2}$. This description breaks down at a so-called jamming transition $N=N^{*}$. At this threshold, we argue that $|f_{N}|$ diverges. This result leads to a plausible explanation for the cusp in test error known to occur at $N^{*}$. Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond $N^{*}$, and averaging their outputs.

Citations (186)

View on Semantic Scholar

Summary

Overview of Scaling Description of Generalization with Number of Parameters in Deep Learning

This paper explores the intriguing phenomenon observed in deep learning, where the generalization error diminishes as the number of parameters in a neural network increases beyond a certain threshold. The authors propose a novel framework that reconciles this behavior with traditional sparsity-based arguments, which would suggest that increasing the number of parameters past a certain point should lead to greater generalization error due to overfitting.

The framework is rooted in the concept of the Neural Tangent Kernel (NTK), which provides a compelling connection between large-scale neural networks and kernel methods. The authors leverage NTK to demonstrate that random fluctuations in neural network outputs, caused by initialization, decrease with the number of parameters $N$ following a power-law $\sim N^{-1/4}$ . These fluctuations affect the generalization error for classification tasks such that the error declines to a plateau value $\epsilon_{\infty}$ as $N^{-1/2}$ .

At the critical jamming transition, identified as $N=N^{*}$ , where a network is just able to fit the training data, the fluctuations diverge, and the generalization error spikes, presenting a possible explanation for the cusp-shaped behavior of test error noticed at this threshold.

Numerical Findings and Experimental Validation

The authors support their theory through extensive empirical observations using the MNIST and CIFAR image datasets. These experiments confirm the theoretically anticipated reduction in generalization error and reveal additional nuances about the network behavior near the jamming transition, including the divergence of the function norm $|f_N|$ at $N^{*}$ .

Furthermore, the research shows that optimal generalization can be attained not by expanding the network far beyond $N^{*}$ , but through ensemble averaging of networks at sizes slightly larger than $N^{*}$ . This insight is crucial for practical applications where computational resources are a constraint.

Practical and Theoretical Implications

From a theoretical perspective, this paper offers a robust scaling description of the generalization capabilities of over-parameterized deep networks, contributing a new angle to understand the landscape of neural network training. Practically, the findings advocate for the use of intermediate-sized networks in ensemble configurations as a cost-effective strategy to enhance generalization performance.

These findings have potential implications for the future development of AI, suggesting efficient strategies to deploy deep learning architectures in resource-constrained environments. Additionally, the insights gained from the NTK-based analysis might be applicable to a broader range of machine learning models, paving the way for innovations in model training and performance optimization.

Future Directions

The paper proposes several directions for future research, such as exploring further the nature of fluctuations at initialization, more deeply understanding the NTK's role during training, and applying the framework to other architectures like convolutional networks, as demonstrated preliminarily in the paper.

In summary, this paper presents a comprehensive scaling theory for generalization in deep learning that is confirmed by numerical evidence, providing valuable insights for the design and deployment of neural networks with an emphasis on computational efficiency and performance optimization.

Related Papers

Tweets

https://twitter.com/ABAtanasov/status/1820946847025999947