An Improved Analysis of Training Over-parameterized Deep Neural Networks (1906.04688v1)

Published 11 Jun 2019 in cs.LG, math.OC, and stat.ML

Abstract: A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the condition on the width of the neural network to ensure the global convergence is very stringent, which is often a high-degree polynomial in the training sample size $n$ (e.g., $O(n^{24})$). In this paper, we provide an improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters. The main technical contributions of our analysis include (a) a tighter gradient lower bound that leads to a faster convergence of the algorithm, and (b) a sharper characterization of the trajectory length of the algorithm. By specializing our result to two-layer (i.e., one-hidden-layer) neural networks, it also provides a milder over-parameterization condition than the best-known result in prior work.

Citations (222)

View on Semantic Scholar

Summary

The paper introduces a tighter gradient lower bound that accelerates GD and SGD convergence by effectively multiplying the gradient by the sample size.
It presents a novel analysis of algorithm trajectories, providing precise bounds on the training path length for improved computational efficiency.
The study establishes global convergence for deep and two-layer ReLU networks with significantly reduced over-parameterization requirements.

An Improved Analysis of Training Over-parameterized Deep Neural Networks

The paper by Difan Zou and Quanquan Gu presents advancements in understanding the global convergence of (stochastic) gradient descent (GD and SGD) when applied to over-parameterized deep neural networks. This work extends and refines existing theories by offering a milder over-parameterization condition than previous research, thus bridging the gap between theory and practice.

Summary of Contributions

The paper articulates several key contributions:

Tighter Gradient Lower Bound and Convergence Analysis: The authors introduce a more precise gradient lower bound that enhances convergence speed. This is achieved by leveraging the concept of a "gradient region" which amalgamates information from all training data, resulting in a bound that is effectively multiplied by the sample size, $n$ . This yields a more rapid convergence rate for GD and SGD compared to previously established results.
Sharper Characterization of Algorithm Trajectories: A novel analytical technique provides a tighter characterization of the trajectory length that GD and SGD traverse during training, offering insights into their computational efficiency.
Global Convergence for Deep and Two-layer ReLU Networks: By extending the convergence results of previous models to deep ReLU networks, the authors illustrate that with Gaussian random initialization, the required neural network width is significantly reduced while maintaining global convergence.

Numerical Results and Claims

For two-layer networks, the paper claims a reduction in the over-parameterization condition to $\tilde\Omega(n^8/\phi^4)$ for GD, a considerable improvement over past conditions, such as $\Omega(n^{14}/\phi^4)$ .
For deep networks, the paper advances the state-of-the-art by presenting a complexity analysis that establishes convergence with network widths much smaller than previously required (e.g., $\tilde\Omega(kn^8L^{12}/\phi^4)$ ), presenting a major theoretical enhancement over conditions demanding widths proportionate to $\Omega(kn^{24}L^{12}/\phi^8)$ .
The iteration complexity for achieving $\epsilon$ training loss is also improved, with GD requiring only $O(n^2L^2\log(1/\epsilon)/\phi)$ iterations and correspondingly efficient results for SGD.
Additionally, for SGD, analogous results surpass those by a factor of $\tilde \Omega(n^7B^5)$ in over-parameterization condition and by $\tilde O(n^2)$ regarding iteration complexity.

Practical and Theoretical Implications

This line of reasoning not only eases the architectural constraints on neural networks needed to ensure effective learning and convergence, but it also offers a more actionable framework that practitioners can leverage while designing their models. This suggests that as long as certain conditions regarding data and initialization are met, the typical large-scale neural networks used in industry do not face an excessive risk of convergence failure due to narrow theoretical margins.

Future Directions

The paper opens several avenues for future exploration:

Investigating further refinements in the over-parameterization conditions needed for convergence could increase the applicability and efficiency of neural networks, especially in resource-constrained environments.
Exploring the implications of the improved gradient bounds and trajectory analysis on other neural architectures could extend the results beyond fully connected networks to include convolutional and recurrent structures.
A deeper investigation into the generalization capabilities of such over-parameterized networks, particularly in adversarial settings, remains an exciting field of inquiry.

In sum, the paper offers a towering framework that alleviates some of the prevailing theoretical restrictions on large-scale neural networks, providing actionable insights for the continued development and deployment of deep learning models in complex problem domains.

PDF Markdown