No bad local minima: Data independent training error guarantees for multilayer neural networks (1605.08361v2)

Published 26 May 2016 in stat.ML, cs.LG, and cs.NE

Abstract: We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.

Citations (236)

View on Semantic Scholar

Summary

The paper demonstrates that under specific architectural conditions, all differentiable local minima in multilayer neural networks yield zero training error.
It employs smoothed analysis to establish conditions for both single and multi-hidden layer networks to avoid high-error configurations despite non-convexity.
Results offer practical guidelines, emphasizing proper layer dimensionality and mild over-parameterization to ensure optimal convergence during training.

Overview of "No Bad Local Minima: Data Independent Training Error Guarantees for Multilayer Neural Networks"

The paper "No Bad Local Minima: Data Independent Training Error Guarantees for Multilayer Neural Networks" by Daniel Soudry and Yair Carmon addresses a critical question in the optimization of multilayer neural networks (MNNs) - specifically, why stochastic gradient descent (SGD) effectively finds low training error configurations despite the non-convexity of the loss landscape. The authors employ smoothed analysis techniques to theoretically demonstrate that under certain conditions, MNNs do not have "bad" local minima with high training errors.

Main Contributions

The key contribution of this paper is the derivation of conditions under which the training error is zero at all differentiable local minima (DLMs) of multilayer neural networks with piecewise linear activation functions and quadratic loss. The authors focus on MNNs with mild over-parameterization and prove their results for both single hidden layer networks and more complex multi-layer architectures:

Single Hidden Layer Networks: The authors establish that for a network with one hidden layer, zero training error occurs at all DLMs. This is guaranteed when the number of weights in the first layer exceeds the number of training samples, i.e., $N \leq d_0 d_1$ , where $d_0$ and $d_1$ are the input and hidden layer dimensions respectively.
Multiple Hidden Layers: For networks with multiple hidden layers, the training error can become zero under perturbations if the inequality $N \leq d_{L-2} d_{L-1}$ holds, where $d_{L-2}$ and $d_{L-1}$ are the dimensions of the second last and last hidden layers, respectively. This suggests that appropriate architectural design can avert convergence to non-global minima.

Theoretical Implications

The findings have broad implications:

Understanding Non-convex Optimization: By demonstrating that certain MNN configurations naturally lack bad local minima, the paper provides insights into why non-convex optimization techniques like SGD are successful.
Architectural Guidelines: The results imply practical guidelines for the architecture design of neural networks. Ensuring sufficient over-parameterization (as derived from the layer size relative to the dataset size) appears crucial to the network's ability to avoid high-error local minima.

Numerical Validation

The authors corroborate their theoretical results with numerical experiments. They show that for networks meeting the dimension criteria outlined, the training error effectively approaches zero, even without dropout noise. These experiments reinforce the practical validity of the theoretical predictions in scenarios that simulate “typical worst-case” inputs.

Future Directions

Several questions and potential avenues for advancement stem from this work:

Extension to Diverse Architectures: While the paper hints at broader applicability, extending the theoretical guarantees to other architectures (e.g., convolutional networks) and non-quadratic loss functions remains an open challenge.
Combination with Generalization Guarantees: Another frontier is leveraging the training error guarantees to derive new bounds for generalization error, potentially surpassing traditional methods reliant on uniform convergence.
Non-Differentiable Critical Points: Further investigation could extend the results to explore the behavior at non-differentiable critical points, where the sub-gradient might still allow for zero-error solutions.

In summary, this paper presents significant theoretical progress in understanding the optimization landscape of MNNs, offering valuable insight that guides the design and training of neural networks in practical applications. Future work can continue to build upon these foundational concepts to refine architectural design principles and optimization strategies further.

PDF Markdown