Convergence Analysis of Two-layer Neural Networks with ReLU Activation (1705.09886v2)

Published 28 May 2017 in cs.LG

Abstract: In recent years, stochastic gradient descent (SGD) based techniques has become the standard tools for training neural networks. However, formal theoretical understanding of why SGD can train neural networks in practice is largely missing. In this paper, we make progress on understanding this mystery by providing a convergence analysis for SGD on a rich subset of two-layer feedforward networks with ReLU activations. This subset is characterized by a special structure called "identity mapping". We prove that, if input follows from Gaussian distribution, with standard $O(1/\sqrt{d})$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps. Unlike normal vanilla networks, the "identity mapping" makes our network asymmetric and thus the global minimum is unique. To complement our theory, we are also able to show experimentally that multi-layer networks with this mapping have better performance compared with normal vanilla networks. Our convergence theorem differs from traditional non-convex optimization techniques. We show that SGD converges to optimal in "two phases": In phase I, the gradient points to the wrong direction, however, a potential function $g$ gradually decreases. Then in phase II, SGD enters a nice one point convex region and converges. We also show that the identity mapping is necessary for convergence, as it moves the initial point to a better place for optimization. Experiment verifies our claims.

Authors (2)

Yuanzhi Li (119 papers)
Yang Yuan (52 papers)

Citations (630)

View on Semantic Scholar

Summary

An Overview of "Deep Residual Learning for Image Recognition"

The paper "Deep Residual Learning for Image Recognition" by He et al. introduces a novel framework aimed at addressing the degradation problem that often occurs when the depth of a neural network is substantially increased. The authors propose Residual Networks (ResNets), a milestone in the development of neural architectures, which have shown superior performance in image recognition tasks by leveraging a simple yet effective residual learning approach.

Core Contributions

The primary contribution of this work is the introduction of residual learning, which addresses the challenges associated with training very deep neural networks. ResNets are constructed using residual blocks, which allow layers to learn residual functions with reference to the layer inputs, rather than learning unreferenced functions. This approach effectively mitigates the vanishing gradient problem that hinders the training of deep networks.

The architecture is characterized by identity shortcuts that enable residual mapping, with negligible computation overhead. These shortcuts skip one or more layers and provide an alternative path for gradient flow, thus preserving the integrity of learned feature representations across layers.

Experimental Results

The authors conduct extensive experiments to evaluate the performance of ResNets. Notably, ResNets surpass performance benchmarks set by prior architectures on the ILSVRC classification dataset. A ResNet with 152 layers achieves a significant reduction in top-5 error rates compared to earlier models like VGG, illustrating substantial gains in accuracy.

Moreover, ResNets demonstrate robust generalization across challenging datasets, such as CIFAR-10 and MSCOCO, cementing their versatility and practical applicability. The paper’s results highlight that deeper networks, previously challenging to optimize, can indeed achieve better performance when residual learning techniques are employed.

Implications and Future Directions

The implications of ResNet are profound, influencing a wide array of applications within computer vision and beyond. The framework has catalyzed advancements in the design of neural architectures, proving instrumental in fields such as NLP, speech recognition, and video analysis.

Future research directions may explore optimizing architectures for efficiency without sacrificing depth, seeking further improvements in training dynamics or incorporating novel regularization techniques. Additionally, the flexibility of residual learning presents a promising avenue for integrating modular, scalable architectures tailored to specific tasks or constraints in diverse domains.

In conclusion, the introduction of deep residual networks signifies a crucial advancement in neural network design, enabling the construction of much deeper architectures without incurring degradation in performance. This work provides a foundation for ongoing exploration into the capabilities of deep learning frameworks and their applicability across varied contexts.

PDF Markdown