Spurious Local Minima are Common in Two-Layer ReLU Neural Networks (1712.08968v3)

Published 24 Dec 2017 in cs.LG and stat.ML

Abstract: We consider the optimization problem associated with training simple ReLU neural networks of the form $\mathbf{x}\mapsto \sum_{i=1}^{{k}\max{0,\mathbf{w}_i^\top} \mathbf{x}}$ with respect to the squared loss. We provide a computer-assisted proof that even if the input distribution is standard Gaussian, even if the dimension is arbitrarily large, and even if the target values are generated by such a network, with orthonormal parameter vectors, the problem can still have spurious local minima once $6\le k\le 20$. By a concentration of measure argument, this implies that in high input dimensions, \emph{nearly all} target networks of the relevant sizes lead to spurious local minima. Moreover, we conduct experiments which show that the probability of hitting such local minima is quite high, and increasing with the network size. On the positive side, mild over-parameterization appears to drastically reduce such local minima, indicating that an over-parameterization assumption is necessary to get a positive result in this setting.

Citations (254)

View on Semantic Scholar

Summary

The paper demonstrates that spurious local minima persist in two-layer ReLU networks with 6 to 20 neurons using computer-assisted proofs.
It reveals that standard Gaussian inputs and gradient descent can consistently converge to suboptimal local minima even under optimal parameter conditions.
The study finds that mild over-parameterization effectively reduces these pitfalls, offering a practical strategy for improving neural network training.

Analysis of "Spurious Local Minima are Common in Two-Layer ReLU Neural Networks"

The paper under examination investigates the optimization challenges associated with two-layer ReLU neural networks. It highlights a significant issue in non-convex optimization: the presence of spurious local minima. The authors employ a rigorous, computer-assisted proof to demonstrate that even in high-dimensional settings with standard Gaussian inputs, these spurious minima exist with a network size in the range of 6 to 20 neurons. Notably, this remains true even when target values are generated by networks with orthonormal parameters.

Summary of Key Findings

The authors frame the problem within the context of squared loss optimization for two-layer ReLU networks. The fundamental assertion is that spurious local minima can and do persist under specific configurations, notably when the network's neuron count falls between 6 and 20. The paper leverages several theoretical constructs to support its claims, including concentration of measure arguments demonstrating that in high-dimensional spaces, most target networks invite these problematic minima.

Experimental Results and Techniques:

Computer-Assisted Proof: Through an unconventional method, the researchers validate the existence of spurious local minima without fully analytical roots. They employ gradient descent from random initializations to discover stationary points, employing numerical methods devoid of floating-point inaccuracies.
Implications for Network Training: The experiments indicate a high probability of convergence to these suboptimal local minima, which rises with network size. This finding is critical as it underlines potential pitfalls in training neural networks using conventional gradient-based methods, especially in the presence of spurious local minima.
Mitigation through Over-Parameterization: One of the encouraging outcomes is the identification that mild over-parameterization can alleviate this issue. Over-parameterized networks demonstrate a reduced frequency of local minima, suggesting a path to optimize training by increasing network size minimally beyond operational requirements.

Theoretical Implications and Open Problems

The theoretical implications of this research are profound. Firstly, it challenges recent notions that non-convex problems might not be as ill-posed as traditionally assumed. This paper confirms that non-convexities do result in optimization difficulties, at least in specific neural network architectures, reinforcing the necessity of certain assumptions or adaptations, such as over-parameterization.

Future research avenues are apparent:

Further exploration of the parameters' effect on the existence and distribution of spurious minima.
Algorithms specifically designed to navigate around or escape from local minima could be pivotal.
Extending these findings to deeper or more complex network architectures could reveal if similar issues persist.

Practical Implications

The paper's findings bear practical implications for the design and training of neural networks. Understanding that increasing the number of neurons can effectively mitigate local minima challenges suggests a strategy for network architecture design. However, this must be balanced with computational considerations, as over-parameterization has its costs.

In conclusion, this paper provides a nuanced understanding of the dynamics at play in two-layer ReLU networks, demonstrating the reality of spurious local minima under certain configurations. It offers a tactical path forward through over-parameterization while also presenting fertile ground for future studies, especially concerning network depth and structure. Such insights are invaluable for advancing robust neural network optimizations in practical applications.

PDF Markdown