Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice (1711.04735v1)

Published 13 Nov 2017 in cs.LG and stat.ML

Abstract: It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is $O(1)$ is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near $1$ is a property known as dynamical isometry. For deep linear networks, dynamical isometry can be achieved through orthogonal weight initialization and has been shown to dramatically speed up learning; however, it has remained unclear how to extend these results to the nonlinear setting. We address this question by employing powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Indeed, we show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

Citations (241)

View on Semantic Scholar

Summary

The paper demonstrates that sigmoidal networks with orthogonal weight initialization achieve dynamical isometry, enabling rapid convergence during training.
The paper leverages free probability theory to derive the singular value distribution of the Jacobian, contrasting the performance of ReLU and sigmoid activations.
Experimental results reveal that orthogonal initialization in sigmoidal networks results in sublinear growth in training time with network depth, improving efficiency.

Analysis of "Resurrecting the Sigmoid in Deep Learning through Dynamical Isometry: Theory and Practice"

The paper "Resurrecting the Sigmoid in Deep Learning through Dynamical Isometry: Theory and Practice" offers a nuanced exploration of the implications of weight initialization strategies on learning speeds in deep neural networks (DNNs). It explores the concept of dynamical isometry, a condition where all singular values of a network's input-output Jacobian are concentrated near one, ensuring steady gradient norms across layers. This investigation encompasses both theoretical derivations and empirical validations, yielding insights pertinent to network design, initialization, and training protocols.

Theoretical Contributions

The theoretical backbone of this paper leverages free probability theory to analytically derive the singular value distribution of a DNN’s Jacobian. Specifically, it examines how this distribution is influenced by several factors: network depth, weight initialization, and nonlinearity selection. The analysis reveals critical insights:

Singular Value Dynamics: The paper demonstrates a stark contrast in singular value distributions between ReLU and sigmoidal networks. ReLU networks, despite their popularity, are shown to inherently lack dynamical isometry due to their singular value distributions deviating significantly from unity. Conversely, sigmoidal networks can achieve dynamical isometry when initialized with orthogonal weights, offering a potentially superior alternative.
Gaussian vs. Orthogonal Initialization: For deep linear networks, orthogonal weight initialization is established to preserve dynamical isometry, leading to accelerated learning. Extending this to nonlinear networks presents challenges. The paper identifies that while Gaussian initializations generally fail to maintain isometry, orthogonal initializations in sigmoidal networks can be engineered to do so effectively.
Criticality and Learning Dynamics: The paper discusses the role of criticality, a state marked by $\chi = 1$ , in avoiding gradient blow-up or vanishing. It further distinguishes between maintaining mean singular values versus managing the entire distribution, highlighting the latter as crucial for rapid convergence.

Empirical Validations

Empirical experiments substantiate the theoretical findings by demonstrating that DNNs initialized to achieve dynamical isometry learn significantly faster than those that do not:

Performance Evaluation: Across various optimizers, sigmoidal networks with orthogonal initialization were observed to outperform ReLU networks by orders of magnitude, validating the profound influence of the singular value distribution on learning dynamics.
Training Efficiency: Experimental results emphasize a sublinear growth in training time with respect to network depth for orthogonal sigmoidal networks, as opposed to the linear growth observed in their non-isometric counterparts. This suggests practical implications in training deep networks efficiently with minimal computational overhead.

Implications and Future Directions

This research underscores the importance of the entire spectrum of the Jacobian's singular values in shaping effective learning trajectories. Practically, it suggests a revisit to sigmoidal activation functions, combined with orthogonal weight initialization, as a promising direction for constructing rapid-learning, deep architectures.

Theoretically, the paper sets the ground for exploring dynamical isometry in more complex network architectures, including convolutional and recurrent neural networks. Future research could focus on developing initialization schemes or network modifications that cater to the preservation of isometry across diverse neural architectures.

In essence, this paper propels a nuanced discourse on network initialization, challenging conventional practice and providing a robust framework for designing DNNs with enhanced convergence properties.

PDF Markdown

Related Papers

Tweets

https://twitter.com/TheOneKloud/status/1816152292329951411