Gradient Descent Provably Optimizes Over-parameterized Neural Networks (1810.02054v2)

Published 4 Oct 2018 in cs.LG, math.OC, and stat.ML

Abstract: One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.

Authors (4)

Simon S. Du (120 papers)
Xiyu Zhai (9 papers)
Barnabas Poczos (173 papers)
Aarti Singh (98 papers)

Citations (1,207)

View on Semantic Scholar

Summary

Overview of the Paper "Gradient Descent Provably Optimizes Over-parameterized Neural Networks"

The paper "Gradient Descent Provably Optimizes Over-parameterized Neural Networks" by Du et. al. addresses a fundamental theoretical question in machine learning: why do randomly initialized first-order methods such as gradient descent succeed in minimizing training loss for over-parameterized neural networks, despite the non-convex and non-smooth nature of the objective functions? Using two-layer fully connected ReLU-activated neural networks as the model of paper, the authors provide rigorous proof that gradient descent converges to a global minimum with a linear rate under certain conditions.

Key Contributions

1. Convergence for Two-layer Neural Networks

The paper proves that for a two-layer neural network with ReLU activation and a quadratic loss function, gradient descent converges to a globally optimal solution provided the network is sufficiently over-parameterized. Specifically, they show that if the number of hidden nodes $m$ is large enough and the inputs are not collinear, gradient descent achieves zero training loss at a linear convergence rate.

2. Important Insights and Techniques

The authors identify key insights that facilitate their proofs:

The dynamics of the individual predictions, $f(W, a, x_i)$ , are more tractable than the direct analysis of the parameter space due to the non-convex and non-smooth nature of the objective function.
They introduce a Gram matrix whose properties are critical for proving the convergence. This matrix captures a convexity-like behavior that is leveraged to show that gradient descent converges linearly to the global optimum.
Over-parameterization forces the weight vectors to remain in the vicinity of their initializations throughout the training process, ensuring the stability of the Gram matrix and enabling the proof of linear convergence.

Numerical Results and Theoretical Claims

The paper does not merely establish convergence but provides specific rates and conditions. Convergence is shown to be $O(\log(1/\epsilon))$ iterations to reach a desired accuracy $\epsilon$ , under the assumption that the least eigenvalue of the Gram matrix is bounded from below. A strong assumption, no two input vectors being collinear, translates to real-world scenarios where the inputs are generally not perfectly aligned.

Analysis of Gradient Flow and Discrete Time Analysis

The authors first analyze the continuous-time analog or the gradient flow of the algorithm. They then extend their analysis to the discrete-time case to reflect practical implementations of gradient descent. The discrete-time analysis shows that with an appropriately chosen step size $\eta = O(\frac{\lambda_0}{n^2})$ , gradient descent retains the same linear convergence rate, substantiating the robustness of their theoretical findings when applied to actual gradient descent algorithms.

Extension to Joint Layer Training

Importantly, the paper extends these results to scenarios where both the weights of the first and second layers are trained simultaneously. They show through similar techniques involving the stability and analysis of Gram matrices that such jointly trained networks also achieve zero training loss with gradient flow.

Practical Implications

These results provide crucial theoretical underpinnings to the empirical success of over-parameterized neural networks. They clarify why deep learning systems with an over-abundance of parameters often perform well despite the complex landscape of their loss functions.

Future Directions

The findings open several promising directions for future research:

Generalization to more complex, deeper neural network architectures while exploring whether over-parameterization guarantees similar convergence behaviors.
Exploration of advanced concentration inequalities and matrix perturbation techniques to better understand the bounds on the number of hidden nodes required for convergence.
Investigation of accelerated first-order methods beyond basic gradient descent using potential functions different from simple empirical loss to improve convergence rates further.

Conclusion

The results by Du et al. bridge a significant gap between empirical observations and theoretical guarantees in the training of neural networks. Their rigorous analysis of gradient descent on over-parameterized neural networks provides strong theoretical foundations and paves the way for further research into the efficiency and optimization of deep learning systems.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/DimitrisPapail/status/1757139278763622494

https://twitter.com/yaroslavvb/status/1757128305960997342