Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks (1805.12076v1)

Published 30 May 2018 in cs.LG and stat.ML

Abstract: Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization with over-parametrization. We further present a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks.

Citations (377)

Summary

  • The paper introduces a novel unit-wise complexity measure that provides tighter generalization bounds for two-layer ReLU networks.
  • It employs rigorous theoretical and empirical analysis to reveal why over-parametrized networks generalize better despite increased capacity.
  • It establishes a lower bound on Rademacher complexity and uses a refined covering argument to address challenges in scaling network size.

Understanding the Role of Over-Parametrization in Neural Network Generalization

The successful application of deep neural networks across a myriad of machine learning tasks has sparked interest in understanding their generalization capabilities, particularly in over-parametrized settings. The growth in model size often leads to an improvement in generalization error, contradicting traditional machine learning wisdom that suggests larger capacities tend to overfit. This counterintuitive phenomenon poses fundamental questions about the nature of complexity measures that can accurately capture the generalization behavior of neural networks.

This paper presents a novel perspective on the complexity of neural networks by introducing a capacity measure based on unit-wise capacities, which provides a tighter bound on the generalization error for two-layer ReLU networks. The authors provide a comprehensive theoretical and empirical analysis that offers insights into why networks generalize better with increased over-parametrization.

Theoretical Contributions:

  1. Novel Complexity Measure: The paper proposes a complexity measure for neural networks using unit-wise capacities, which emphasizes the individual capacity of each hidden unit in a network. This measure is shown to decrease with the increasing number of hidden units, differing from traditional complexity measures, such as VC bounds, which grow with model size.
  2. Improved Generalization Bound: The authors derive a generalization bound for two-layer ReLU networks that correlates with the empirical decrease in test error observed with larger networks. By focusing on the unit level rather than the entire network or layer, the paper demonstrates a more nuanced understanding of complexity that is related to the actual structure of networks learned in practice.
  3. Rademacher Complexity Lower Bound: The paper also presents a lower bound for the Rademacher complexity of two-layer ReLU networks, which significantly improves upon existing bounds. This lower bound emphasizes the importance of ReLU activations in understanding the capacity of neural networks, showing a gap between linear models and networks with ReLU.
  4. Covering Argument for Large Networks: For extremely large networks, a refined generalization bound is provided using a different covering argument. This addresses the potential increase in generalization error for very large networks due to additive terms, ensuring the bounds remain relevant even as network size scales further.

Implications and Future Directions:

The implications of this work are multi-faceted, influencing both theoretical understanding and practical applications of deep learning models:

  • Better Generalization Understanding: The insights into unit-wise capacity offer a deeper understanding of why over-parametrized neural networks do not necessarily overfit to data and continue to generalize well. This perspective on complexity adjusts our view on how over-parametrization aids network training and generalization.
  • Optimization and Initialization: The closeness of trained weights to initialization found in over-parametrized settings provides an empirical basis for understanding the optimization landscape of neural networks. Further exploration into initialization and its impact on generalization presents a promising area of research.
  • Expanding Beyond Two Layers: This paper is limited to two-layer networks. Extending these theoretical findings to deeper networks remains an open question. Understanding if and how these unit-wise measures apply to deeper or more complex architectures will be crucial.
  • Numerical Bounds: Although lower than previous bounds, the numerical values of the capacity bounds are still loose. Developing methods to tighten these bounds remains a crucial area for ongoing research.

By advancing our understanding of neural network generalization through these novel theoretical perspectives, this paper sets the stage for further explorations into the capacities and complexities that define over-parametrized models in machine learning.