Mean Field Residual Networks: On the Edge of Chaos (1712.08969v1)

Published 24 Dec 2017 in cs.NE, cond-mat.dis-nn, cs.LG, math.DS, and nlin.CD

Abstract: We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. The exponents of these polynomials are obtained through analytic methods and proved and verified empirically to be correct. In terms of the "edge of chaos" hypothesis, these subexponential and polynomial laws allow residual networks to "hover over the boundary between stability and chaos," thus preserving the geometry of the input space and the gradient information flow. In our experiments, for each activation function we study here, we initialize residual networks with different hyperparameters and train them on MNIST. Remarkably, our initialization time theory can accurately predict test time performance of these networks, by tracking either the expected amount of gradient explosion or the expected squared distance between the images of two input vectors. Importantly, we show, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth. Finally, we have made mathematical contributions by deriving several new identities for the kernels of powers of ReLU functions by relating them to the zeroth Bessel function of the second kind.

Citations (187)

View on Semantic Scholar

Summary

The paper demonstrates that skip connections induce subexponential, polynomial dynamics which mitigate issues of vanishing or exploding gradients.
It shows that residual networks operating on the edge of chaos balance stability and trainability by preserving input space geometry.
Experimental analyses reveal that conventional initialization schemes are suboptimal for deep residual architectures, suggesting the need for tailored strategies.

Mean Field Residual Networks: On the Edge of Chaos

In the paper "Mean Field Residual Networks: On the Edge of Chaos," the authors delve into the dynamics of residual networks using the frameworks of mean field theory and difference equations. They explore how these architectures, particularly under random initialization, display unique properties that distinguish them from classical feedforward neural networks.

Dynamics of Residual Networks

The research primarily contrasts residual networks with traditional feedforward networks that use tanh activations. Feedforward networks traditionally exhibit exponential behaviors concerning input propagation and gradient flows, leading to issues such as rapid input space geometry collapse and vanishing/exploding gradients. The key insight from this paper is the impact of skip connections in residual networks, which induce subexponential dynamics in both forward and backward passes, often resulting in polynomial behavior. The exponents of these polynomial dynamics are both analytically derived and empirically validated.

Edge of Chaos Hypothesis

Residual networks, according to the paper, are often able to exist "on the edge of chaos." This term refers to a balance between stability and chaos where a network maintains geometric properties of input spaces while allowing adequate gradient flow without drastic vanishing or explosion. This balance is crucial for preserving both expressivity and trainability, particularly in deep neural networks.

Numerical and Theoretical Insights

Through various experiments, the paper demonstrates predictive correlations between initialization hyperparameters and test performance on MNIST datasets. The paper highlights that common initialization schemes (e.g., Xavier or He) are suboptimal for residual networks due to depth-dependent variance considerations. Mathematically, they introduce new identities for activation kernels linked to Bessel functions, providing further insights into neural network behaviors.

Implications and Future Directions

From a practical standpoint, the findings suggest avenues for optimizing initialization strategies specific to residual networks, potentially improving deep learning task performances by maintaining coherence across depth levels. The work prompts further exploration into more intricate neural architectures that incorporate components like batch normalization and convolutional layers.

This paper lays foundational work for the statistical theory of residual networks, indicating that their architecture inherently supports stably chaotic dynamics advantageous for training deep networks. Continued research in this domain could focus on probing these theories within even more complex and practical models, perhaps leading to breakthroughs in neural network efficiency and effectiveness in real-world applications.

PDF Markdown