Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 29 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 98 tok/s Pro

GPT OSS 120B 424 tok/s Pro

Kimi K2 164 tok/s Pro

2000 character limit reached

Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? (1801.03744v3)

Published 11 Jan 2018 in stat.ML, cs.LG, math.PR, math.ST, and stat.TH

Abstract: We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths. When beta is large, the gradients computed by N at initialization vary wildly. Our approach complements the mean field theory analysis of random networks. From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos.

Citations (232)

View on Semantic Scholar

Collections

Summary

The paper demonstrates that the severity of the exploding and vanishing gradient problem is governed by β, the sum of the reciprocals of hidden layer widths.
It introduces exact formulas for the joint even moments of the input-output Jacobian, enabling finite-depth and finite-width analysis beyond mean field theory.
The study advises that evenly distributing neurons across layers minimizes β, offering practical strategies to improve gradient stability during training.

An Analysis of Neural Network Architectures and the Exploding and Vanishing Gradient Problem

Boris Hanin's paper, "Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?" investigates the well-known challenge of the exploding and vanishing gradient problem (EVGP) in fully connected $\Relu$ neural networks. This examination is critical since EVGP poses significant obstacles to the efficient training of deep neural networks using gradient-based optimization methods, particularly stochastic gradient descent (SGD).

The primary objective of Hanin's analysis is to identify which architectural parameters in neural networks, especially those initialized randomly, affect the scale of the gradients contributing to EVGP. The paper achieves this by quantifying the empirical variance of gradient squares, linking it to a simple architecture-dependent constant denoted as $\beta$ . This constant, expressed as the sum of the reciprocals of hidden layer widths, governs the statistical behavior of the gradients. When $\beta$ is large, the gradients demonstrate significant variability, indicating severe cases of EVGP.

Key Contributions

Exact Formulas for Joint Moments: The paper presents new formulas for the joint even moments of the input-output Jacobian entries in $\Relu$ networks with random weights and biases (Theorem \ref{T:path-moments}). These formulas are applicable even at finite depths and widths, an enhancement over previous broad limits derived from mean field theory.
Empirical Variance and Training Dynamics: The analysis shows that the empirical variance of the gradients is exponential with respect to the sum of reciprocals of the hidden layer widths. This feature suggests that architectures where the sum, $\beta$ , is large will initially train very slowly, possibly requiring extended epochs for improved performance (as demonstrated in the accompanying Figure \ref{fig:fixed_width}).
Independence from Initialization Distribution: One particularly noteworthy proposition is that the occurrence of EVGP is decisively a function of the network architecture rather than the specific distribution types from which weights and biases are drawn—given they adhere to proper variance scaling at initialization (Definition \ref{D:rand-relu}).

Practical Implications and Theoretical Significance

An immediate practical implication arises concerning architecture selection. Hanin's results implore practitioners to pursue architectures minimizing $\beta$ , thereby reducing gradient fluctuation magnitude and circumventing EVGP. Specifically, given fixed depth and the number of total parameters, hidden layer neurons should be distributed as evenly as possible across layers—analogous to optimizing the power-mean inequality for architectural symmetricity.

The theoretical implications extend to refining our understanding of finite-width effects on deep networks situated at the edge of chaos. By rigorously proving finite-width corrections—long a focus of mean field limits—this paper bridges a gap in EVGP literature between infinite and finite network settings.

Reflection and Future Directions

While the paper offers substantial insights for $\Relu$ networks, a limitation lies in its confinement to fully connected architectures, leaving other types such as convolutional, residual, and recurrent networks for future exploration. Furthermore, the paper presumes independence and zero-centeredness of non-zero weights, constraining its direct applicability to orthogonal initializations or biased weights.

Anticipated future work endeavors to generalize these qualitative findings to other network types and investigate potential modifications to incorporate orthogonal and other weight initializations into this robust framework. In conclusion, Hanin's work critically advances our quantitative and theoretical grasp on how and why deep network architectures either succumb to or overcome the gradient issues that can hinder their training efficacy.