Deep learning generalizes because the parameter-function map is biased towards simple functions (1805.08522v5)

Published 22 May 2018 in stat.ML, cs.AI, cs.LG, and cs.NE

Abstract: Deep neural networks (DNNs) generalize remarkably well without explicit regularization even in the strongly over-parametrized regime where classical learning theory would instead predict that they would severely overfit. While many proposals for some kind of implicit regularization have been made to rationalise this success, there is no consensus for the fundamental reason why DNNs do not strongly overfit. In this paper, we provide a new explanation. By applying a very general probability-complexity bound recently derived from algorithmic information theory (AIT), we argue that the parameter-function map of many DNNs should be exponentially biased towards simple functions. We then provide clear evidence for this strong simplicity bias in a model DNN for Boolean functions, as well as in much larger fully connected and convolutional networks applied to CIFAR10 and MNIST. As the target functions in many real problems are expected to be highly structured, this intrinsic simplicity bias helps explain why deep networks generalize well on real world problems. This picture also facilitates a novel PAC-Bayes approach where the prior is taken over the DNN input-output function space, rather than the more conventional prior over parameter space. If we assume that the training algorithm samples parameters close to uniformly within the zero-error region then the PAC-Bayes theorem can be used to guarantee good expected generalization for target functions producing high-likelihood training sets. By exploiting recently discovered connections between DNNs and Gaussian processes to estimate the marginal likelihood, we produce relatively tight generalization PAC-Bayes error bounds which correlate well with the true error on realistic datasets such as MNIST and CIFAR10 and for architectures including convolutional and fully connected networks.

Authors (3)

Guillermo Valle-Pérez (8 papers)
Chico Q. Camargo (6 papers)
Ard A. Louis (51 papers)

Citations (214)

View on Semantic Scholar

Summary

The paper reveals that the parameter-function map biases DNNs toward simple functions, underpinning effective generalization.
It employs an algorithmic information theory framework and PAC-Bayes bounds to establish tighter generalization error limits linked to low complexity.
Results validated by Gaussian process approximations on datasets like CIFAR-10 and MNIST reframe our understanding of model design.

Examining the Simplicity Bias in Deep Neural Networks' Generalization

This paper addresses one of the intriguing mysteries of deep learning: the ability of deep neural networks (DNNs) to generalize effectively despite significant over-parameterization, which traditional learning theory suggests should lead to overfitting. The authors propose a novel explanation for this phenomenon, rooted in algorithmic information theory (AIT), positing that the mapping from parameters to functions in many DNNs is exponentially biased towards simple functions. This simplicity bias, they argue, is a key factor in DNNs' generalizability across a range of tasks.

The central thesis is that the parameter-function map in DNNs inherently biases the network towards selecting simpler functions. This implies that the neural network's architecture itself serves as a form of implicit regularization, promoting generalization even in highly over-parameterized models. Such findings challenge existing models that attribute generalization to specific optimization algorithms like SGD or explicit regularization techniques.

The authors employ a range of methods, including a connection to Gaussian processes, to demonstrate the exponential bias towards simplicity. They focus on Boolean functions and extend their analysis to larger networks such as convolutions applied to datasets like CIFAR-10 and MNIST. Notably, the work utilizes a PAC-Bayes framework, adopting a prior over the space of functions rather than the traditional parameter space approach. This reformulation enables tighter generalization bounds that align closely with empirical observations.

Key Contributions and Empirical Evidence

Parameter-Function Mapping and Simplicity Bias: The authors reveal through both theoretical arguments and empirical evidence that the parameter-function map significantly favors simple functions. By sampling from random neural networks applied to Boolean functions and larger architectures, they empirically demonstrate that the likelihood of obtaining a function upon random parameter selection exhibits a strong bias towards those with lower descriptional complexity.
AIT-Based Generalization Insight: By utilizing results from AIT, they derive bounds illustrating that high-probability functions have low Kolmogorov complexity. This relationship holds across various architectures and datasets, supporting the claim that simplicity bias is a pervasive trait of DNNs.
Gaussian Process Approximation: The paper approximates the prior over functions using Gaussian processes, successfully correlating these estimates with DNN marginal likelihoods. The authors show that even under finite conditions, this approximation mirrors realistic DNN behavior, providing a robust tool for analyzing generalization.
PAC-Bayes Generalization Bounds: Their novel application of PAC-Bayes theory to the function space, enabled by assuming uniform sampling within the zero-error parameter region, yields generalization error bounds that capture the empirical error dynamics across different datasets and architectures. These bounds are notably tighter than those typically obtained using parameter space priors.

Implications and Future Directions

The implications of these findings are manifold. The recognition of an intrinsic simplicity bias in DNNs could reshape our understanding of model generalization, with potential consequences for model design and the development of new architectures. It suggests that the architecture's role in shaping the hypothesis class bias is more critical than the specific optimization procedures employed.

Looking forward, this framework may offer insights into the development of simpler models that maintain generalization capabilities, potentially leading to more computationally efficient architectures. Moreover, further exploration into different types of complexity measures and their relation to real-world regularities could refine our understanding of what drives effective generalization.

Future work could involve exploring the extent to which this simplicity bias holds across more complex models and tasks, including multi-class classification and regression. Additionally, translating these theoretical insights into practical algorithms that leverage simplicity bias for enhanced performance could be a promising direction.

In summary, this paper contributes a compelling perspective on why deep learning models generalize well under over-parameterization. The identification of a simplicity bias in the parameter-function map opens new avenues for both theoretical exploration and practical application in the field of machine learning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/burny_tech/status/1781432690538889619

https://twitter.com/norabelrose/status/1763055824946253842

https://twitter.com/guillefix/status/1795587211074949613

https://twitter.com/John_W_Maki/status/1781742200595902490

https://twitter.com/rhinigtas/status/1782465455824699509

YouTube

Show All Videos