A Mean Field Theory of Batch Normalization (1902.08129v2)

Published 21 Feb 2019 in cs.NE, cond-mat.dis-nn, cs.LG, and math.DS

Abstract: We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range. Our theory leverages Laplace, Fourier, and Gegenbauer transforms and we derive new identities that may be of independent interest.

Authors (5)

Greg Yang (35 papers)
Jeffrey Pennington (45 papers)
Vinay Rao (7 papers)
Jascha Sohl-Dickstein (88 papers)
Samuel S. Schoenholz (45 papers)

Citations (175)

View on Semantic Scholar

Summary

The paper demonstrates that batch normalization induces exponential gradient explosion with increased depth, making deep networks without skip connections untrainable.
It employs Laplace, Fourier, and Gegenbauer transforms to rigorously characterize the forward and backward dynamics of batch-normalized networks.
The study shows that tuning networks into a linear regime and early optimization steps can stabilize gradients and mitigate batchnorm limitations.

An Analysis of "A Mean Field Theory of Batch Normalization"

The paper, "A Mean Field Theory of Batch Normalization," provides a comprehensive theoretical exploration of batch normalization (batchnorm) within the framework of mean field theory in fully-connected deep neural networks. Authored by Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S. Schoenholz, the paper explores the intricacies of signal propagation and gradient backpropagation in batch-normalized neural networks at initialization.

Key Contributions and Insights

Batchnorm-Induced Gradient Explosion: The authors establish that batchnorm causes an exponential growth in gradient signals with network depth, a phenomenon that they term "gradient explosion." This occurs irrespective of initial weight variances or non-linear activation functions. They demonstrate that vanilla fully-connected networks with batchnorm, lacking skip connections, are non-trainable at large depths due to this explosion, a prediction supported by empirical simulations.
Characterization of Forward and Backward Dynamics: The paper articulates the behavior of batch-normalized neural networks through Laplace, Fourier, and Gegenbauer transforms. These mathematical techniques provide valuable insights into the dynamics at play, particularly emphasizing the mathematical novelty of deriving previously unknown identities that contribute to the field of neural network theory.
Spectrum and Limitations of Activation Functions: Analyzing various activation functions, the paper demonstrates that while the gradient explosion cannot be entirely eliminated, it can be mitigated by tuning the network into a linear regime. This adjustment improves the trainability of deep networks without residual connections.
Stable Dynamical Equilibrium Post-Optimization: The paper identifies a stable equilibrium reached after a single optimization step, where gradients achieve a considerably smaller dynamic range. This suggests that while initial gradient explosion poses a challenge, deep networks may stabilize through iterative training steps.
Theoretical Tools for Analyzing Batchnorm: The authors extend existing theoretical frameworks to include batchnorm, offering tools to predict valid hyperparameter regions before training. The work highlights new mathematical techniques involving the use of transforms to comprehend batchnorm's effects rigorously.

Implications and Future Directions

The theoretical findings presented in the paper have profound implications for the design and training of deep neural networks. Understanding gradient explosion due to batchnorm unveils potential areas for optimization in network architecture and initialization strategies. This knowledge informs practitioners in tuning hyperparameters and exploring alternative structural components, such as skip connections, to counteract batchnorm's limitations at great depths.

Additionally, the mathematical rigor introduced through novel transform applications offers promising avenues for further theoretical advancements. Researchers are provided with the foundational tools to dissect other architectural innovations and normalization techniques, potentially leading to a unified framework applicable across diverse network topologies.

The exploration of batchnorm dynamics contributes to a deeper understanding of neural network architectures, informing both theoretical investigations and practical implementations. The insights gleaned from this paper encourage continued inquiries into optimization dynamics and architectural experimentation within deep learning frameworks, fostering innovations that align theoretical predictions with empirical outcomes.

PDF Markdown

Related Papers

YouTube

Show All Videos