On weight initialization in deep neural networks (1704.08863v2)

Published 28 Apr 2017 in cs.LG

Abstract: A proper initialization of the weights in a neural network is critical to its convergence. Current insights into weight initialization come primarily from linear activation functions. In this paper, I develop a theory for weight initializations with non-linear activations. First, I derive a general weight initialization strategy for any neural network using activation functions differentiable at 0. Next, I derive the weight initialization strategy for the Rectified Linear Unit (RELU), and provide theoretical insights into why the Xavier initialization is a poor choice with RELU activations. My analysis provides a clear demonstration of the role of non-linearities in determining the proper weight initializations.

Citations (211)

View on Semantic Scholar

Summary

The paper introduces a generalized weight initialization method based on a variance formula adaptable to various differentiable activation functions.
It demonstrates why Xavier initialization fails for RELU activations and validates the He initialization through strong theoretical reasoning.
Empirical evaluations on CIFAR-10 show enhanced convergence speed and accuracy, highlighting the method’s practical benefits for deep learning.

An Examination of Weight Initialization in Deep Neural Networks

This paper authored by Siddharth Krishna Kumar addresses the pivotal aspect of weight initialization in deep neural networks, which has a significant impact on the convergence and performance of learning algorithms. The discussion is anchored around three main objectives: proposing a generalized weight initialization strategy for networks equipped with differentiable non-linear activation functions, refining initialization strategies specifically for the Rectified Linear Unit (RELU) activations, and evaluating the inadequacies of existing strategies like the Xavier initialization under certain conditions.

Weight Initialization for Non-linear Activations

Initially, the paper focuses on bridging the theoretical gap concerning weight initialization for networks using non-linear activation functions, extending from the foundational works by Glorot and Bengio (2010) and He et al. (2015). Kumar proposes a generalized formula for activation functions differentiable at zero, postulating that the weights should be initialized with a variance defined as:

$v^2 = \frac{1}{N(g'(0))^2(1 + g(0)^2)}$

Here, $N$ denotes the number of nodes feeding into a layer, $g(x)$ is the activation function, with $g(0)$ and its derivative $g'(0)$ evaluated at zero. This formulation accommodates varying activation functions such as hyperbolic tangent and sigmoid by adjusting the calculation of the variance, thus providing a more adaptable and theoretically grounded initialization method.

Addressing RELU Limitations

A significant portion of the analysis pertains to the RELU activation function, a ubiquitous choice in modern architectures that is not differentiable at zero. The author delineates why the Xavier initialization falters with RELUs and champions the He initialization as a theoretically validated alternative. By deducing that for optimal convergence the weights should be initialized with a variance of approximately $2/N$, the paper substantiates He et al.'s empirical findings with a robust mathematical explanation.

Empirical Evaluations

Kumar draws attention to empirical evaluations with a 10-layer network on the CIFAR-10 dataset. The results demonstrate substantial improvements in convergence speed and accuracy when implementing the derived initialization strategies, in comparison to traditional approaches. Particularly, networks utilizing the sigmoid activation achieved more rapid convergence with the generalized initialization over the Xavier baseline.

Theoretical and Practical Implications

The theoretical implications are profound; the generalized approach not only solidifies the understanding of variance propagation in non-linear activations but also lays the groundwork for analyses of emerging non-differentiable activations. Practically, these refined initializations hold promise for enhancing training stability and efficiency across diverse deep learning applications.

Future Directions and Speculations

Looking ahead, the generalizability of these strategies invites further exploration into other non-linear activation functions that are gaining traction in research. There is an evident opportunity to optimize initializations tailored to specific architectures, potentially uncovering new paradigms in neural network design and training efficiency.

Through the amalgamation of theoretical rigor and empirical verification, this paper contributes a notable advancement in the comprehension and application of weight initialization strategies within the deep learning community. As neural architectures evolve, these insights will undoubtedly continue to shape the optimization processes foundational to effective deep learning systems.

PDF Markdown