SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data (1710.10174v1)

Published 27 Oct 2017 in cs.LG

Abstract: Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations. Nonetheless, current generalization bounds for neural networks fail to explain this phenomenon. In an attempt to bridge this gap, we study the problem of learning a two-layer over-parameterized neural network, when the data is generated by a linearly separable function. In the case where the network has Leaky ReLU activations, we provide both optimization and generalization guarantees for over-parameterized networks. Specifically, we prove convergence rates of SGD to a global minimum and provide generalization guarantees for this global minimum that are independent of the network size. Therefore, our result clearly shows that the use of SGD for optimization both finds a global minimum, and avoids overfitting despite the high capacity of the model. This is the first theoretical demonstration that SGD can avoid overfitting, when learning over-specified neural network classifiers.

Authors (4)

Alon Brutzkus (10 papers)
Amir Globerson (87 papers)
Eran Malach (37 papers)
Shai Shalev-Shwartz (67 papers)

Citations (272)

View on Semantic Scholar

Summary

The paper shows that SGD converges to a global minimum of the empirical loss in over-parameterized two-layer networks using Leaky ReLU activations on linearly separable data.
It employs a compression-based approach to derive generalization bounds that grow sub-linearly with the number of parameters even in highly over-parameterized settings.
The analysis reveals an inductive bias in SGD towards models that mimic the underlying linear classifier, ensuring effective learning despite non-convexity.

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

The paper "SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data" presents a theoretical framework addressing the role of stochastic gradient descent (SGD) in learning over-parameterized neural networks, specifically for linearly separable binary classification problems. The investigation into this setting is motivated by the practical observation that SGD can optimize neural networks with more parameters than training samples without compromising their generalization ability, a phenomenon not yet fully explained by existing theoretical bounds.

In this work, the authors focus on two-layer neural networks with fixed second-layer weights and Leaky ReLU activations, exploring the optimization and generalization behavior of SGD under these conditions. The paper introduces several significant theorems and results to substantiate its claims.

Key Contributions

Convergence and Generalization:
- The authors establish that SGD converges to a global minimum of the empirical loss, even when the network is heavily over-parameterized. This is significant because the loss function in this scenario is non-convex, which typically poses challenges for convergence proofs.
- The convergence rate is bound independent of the network size, implying that the generalization performance for SGD does not inherently degrade with increasing network capacity, provided the input data is linearly separable.
Compression-based Generalization Bounds:
- Using a compression-based approach, the authors provide explicit generalization bounds. These bounds suggest that the number of training samples needed for effective generalization grows sub-linearly with the parameter space size.
- Remarkably, the authors demonstrate that the compression bound is achievable when $\eta \rightarrow \infty$ , denoting optimal learning rate scenarios where the theory holds strongly.
Inductive Bias of SGD:
- The paper reveals an inherent inductive bias in SGD that guides it toward models that replicate the linear classifier responsible for the data's separability, despite the high expressiveness of the networks used.
ReLU Activation Analysis:
- Extending the analysis to ReLU activations, the authors identify a structural difference in optimization dynamics. Notably, spurious local minima are present, potentially obstructing convergence to a global minimum unless specific conditions are met.

Implications and Future Directions

This paper advances our understanding of neural network training dynamics, particularly in the context of over-parameterized models. The theoretical guarantees for SGD's effectiveness suggest that similar methods could be extended to more complex architectures or alternative activation functions, sparking further inquiry into more generalized learning scenarios.

From a practical perspective, the results highlight the importance of SGD's algorithmic design in controlling overfitting, suggesting avenues for designing tailored optimization routines that leverage the data's intrinsic properties.

In terms of theoretical development, future work could explore these concepts under different assumptions or within more generalized frameworks, such as non-linear separability or multilayer architectures. Additional exploration into adaptive learning rates and initialization strategies could refine the presented bounds and lead to broader applicability across different machine learning tasks.

This paper significantly contributes to the foundational understanding of why and how neural networks can generalize well despite being over-parameterized, primarily when SGD is employed as the optimization strategy. It opens up promising directions for developing more robust theoretical frameworks that describe neural network training and generalization across various data domains and configurations.

PDF Markdown

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data (1710.10174v1)

Summary

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

Key Contributions

Implications and Future Directions

Related Papers