- The paper demonstrates that the severity of the exploding and vanishing gradient problem is governed by β, the sum of the reciprocals of hidden layer widths.
- It introduces exact formulas for the joint even moments of the input-output Jacobian, enabling finite-depth and finite-width analysis beyond mean field theory.
- The study advises that evenly distributing neurons across layers minimizes β, offering practical strategies to improve gradient stability during training.
An Analysis of Neural Network Architectures and the Exploding and Vanishing Gradient Problem
Boris Hanin's paper, "Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?" investigates the well-known challenge of the exploding and vanishing gradient problem (EVGP) in fully connected $\Relu$ neural networks. This examination is critical since EVGP poses significant obstacles to the efficient training of deep neural networks using gradient-based optimization methods, particularly stochastic gradient descent (SGD).
The primary objective of Hanin's analysis is to identify which architectural parameters in neural networks, especially those initialized randomly, affect the scale of the gradients contributing to EVGP. The paper achieves this by quantifying the empirical variance of gradient squares, linking it to a simple architecture-dependent constant denoted as β. This constant, expressed as the sum of the reciprocals of hidden layer widths, governs the statistical behavior of the gradients. When β is large, the gradients demonstrate significant variability, indicating severe cases of EVGP.
Key Contributions
- Exact Formulas for Joint Moments: The paper presents new formulas for the joint even moments of the input-output Jacobian entries in $\Relu$ networks with random weights and biases (Theorem \ref{T:path-moments}). These formulas are applicable even at finite depths and widths, an enhancement over previous broad limits derived from mean field theory.
- Empirical Variance and Training Dynamics: The analysis shows that the empirical variance of the gradients is exponential with respect to the sum of reciprocals of the hidden layer widths. This feature suggests that architectures where the sum, β, is large will initially train very slowly, possibly requiring extended epochs for improved performance (as demonstrated in the accompanying Figure \ref{fig:fixed_width}).
- Independence from Initialization Distribution: One particularly noteworthy proposition is that the occurrence of EVGP is decisively a function of the network architecture rather than the specific distribution types from which weights and biases are drawn—given they adhere to proper variance scaling at initialization (Definition \ref{D:rand-relu}).
Practical Implications and Theoretical Significance
An immediate practical implication arises concerning architecture selection. Hanin's results implore practitioners to pursue architectures minimizing β, thereby reducing gradient fluctuation magnitude and circumventing EVGP. Specifically, given fixed depth and the number of total parameters, hidden layer neurons should be distributed as evenly as possible across layers—analogous to optimizing the power-mean inequality for architectural symmetricity.
The theoretical implications extend to refining our understanding of finite-width effects on deep networks situated at the edge of chaos. By rigorously proving finite-width corrections—long a focus of mean field limits—this paper bridges a gap in EVGP literature between infinite and finite network settings.
Reflection and Future Directions
While the paper offers substantial insights for $\Relu$ networks, a limitation lies in its confinement to fully connected architectures, leaving other types such as convolutional, residual, and recurrent networks for future exploration. Furthermore, the paper presumes independence and zero-centeredness of non-zero weights, constraining its direct applicability to orthogonal initializations or biased weights.
Anticipated future work endeavors to generalize these qualitative findings to other network types and investigate potential modifications to incorporate orthogonal and other weight initializations into this robust framework. In conclusion, Hanin's work critically advances our quantitative and theoretical grasp on how and why deep network architectures either succumb to or overcome the gradient issues that can hinder their training efficacy.