Exploring Generalization in Deep Learning (1706.08947v2)

Published 27 Jun 2017 in cs.LG

Abstract: With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.

Citations (1,197)

View on Semantic Scholar

Summary

The paper demonstrates that combining norm-based measures with sharpness under the PAC-Bayes framework better predicts the generalization of over-parameterized networks.
The paper shows that evaluating norms, margins, and Lipschitz continuity helps distinguish models trained on true labels from those with random labels.
The study’s theoretical conditions for bounding sharpness provide practical insights for enhancing deep learning model optimization.

Overview of "Exploring Generalization in Deep Learning"

The paper "Exploring Generalization in Deep Learning" by Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro investigates the factors that impact the generalization of deep neural networks, despite their over-parameterized architectures. The research discusses various complexity measures including norm-based measures, sharpness, and robustness, and examines how they contribute to the generalization capabilities of neural networks.

Introduction

The introduction of the paper highlights the empirical success of deep learning methods, such as SGD, in achieving low training error and good generalization even with models having many more parameters than the training data. Specifically, the paper raises pivotal questions such as what biases are introduced by optimization algorithms and what notions of complexity or capacity control are relevant for neural networks. The introduction also emphasizes the inadequacy of parameter-counting complexity measures and the necessity to investigate other potential measures.

Complexity Measures Investigated

The paper scrutinizes several complexity measures and their ability to explain generalization in deep learning:

Norms and Margins:
- Norm-based measures such as the $\ell_2$ norm and path norms are studied. These measures are often used in linear models for capacity control and have been extended for feedforward networks with ReLU activations.
- The authors also discuss the importance of accounting for scaling when comparing norms and introduce the concept of margins to adjust for this. The empirical results reveal that these complexity measures align well with intuitive notions of generalization, particularly when comparing networks trained on true versus random labels.
Lipschitz Continuity and Robustness:
- The paper explores the capacity control provided by Lipschitz continuity and related robustness measures. However, the authors argue that such measures alone are insufficient for effective capacity control, as indicated by their bounded capacity scaling exponentially with the input dimension.
Sharpness:
- Sharpness, as introduced by Keskar et al. (2016), is considered as a measure of robustness of the training error to perturbations in the parameters.
- The research suggests that sharpness on its own is not adequate for generalization control due to its sensitivity to parameter scale. Instead, they advocate combining sharpness with norm-based measures under the PAC-Bayes framework. This combined measure provides a more comprehensive notion of complexity by balancing sharpness and norm considerations.

Empirical Investigation

The authors conduct various empirical studies to validate the effectiveness of the proposed complexity measures:

True vs Random Labels:
- The gap in complexity between models trained on true and random labels serves as a test for the appropriateness of these measures. Results indicate that measures like the $\ell_2$ norm and path norms correctly predict better generalization for networks trained on true labels.
Different Global Minima:
- The paper examines global minima (different parameter settings with zero training error) achieved by training with varying fractions of randomly labeled data. Norm-based and PAC-Bayes measures effectively correlate with generalization performance amongst these minima.
Increasing Network Size:
- An evaluation of networks with a varying number of hidden units demonstrates that norm-based measures and the combined PAC-Bayes measure explain the observed trend where increasing hidden units improves generalization up to a point.

Theoretical Insights on Sharpness

To further understand sharpness in deep networks, the paper provides theoretical conditions under which sharpness can be bounded. The conditions (C1, C2, C3) address weak interactions between layers, the sensitivity of activations to perturbations, and spiky hidden unit weights. These conditions serve as guidelines for ensuring low sharpness and consequently better generalization.

Implications and Future Directions

The research contributes to a deeper theoretical and empirical understanding of generalization in deep learning:

Practical Implications:
- Insights from combining sharpness with norms suggest more holistic metrics for model selection in practical AI applications.
Theoretical Implications:
- The findings prompt further investigation into the implicit regularization effects of optimization algorithms and their connections to the proposed complexity measures.
Future Developments in AI:
- The paper sets the stage for new research directions focused on optimization biases and their impact on the complexity and generalization of neural networks.

Conclusion

This paper provides significant insights into the complexity measures that can explain the generalization behavior of deep neural networks. While no single measure appears to suffice on its own, norm and margin-based measures combined with sharpness under the PAC-Bayes framework offer a promising direction. The proposed conditions for bounding sharpness further our theoretical understanding and offer practical avenues for improving generalization in deep learning models.

PDF Markdown