Deep Information Propagation (1611.01232v2)

Published 4 Nov 2016 in stat.ML and cs.LG

Abstract: We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a specific choice of hyperparameters. As a corollary to this, we argue that in networks at the edge of chaos, one of these depth scales diverges. Thus arbitrarily deep networks may be trained only sufficiently close to criticality. We show that the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks. Finally, we develop a mean field theory for backpropagation and we show that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.

Citations (353)

View on Semantic Scholar

Summary

The paper establishes the existence of critical depth scales, such as ξ_q and ξ_c, that constrain signal propagation in random neural networks.
It employs mean field theory to link ordered and chaotic phases with the behavior of gradients during backpropagation.
Empirical results on MNIST and CIFAR10 confirm that alignment of network depth with these scales is essential for effective training.

Analyzing Deep Information Propagation in Random Neural Networks

The paper "Deep Information Propagation" provides a comprehensive analysis of signal propagation in untrained neural networks with randomly initialized weights and biases. Through mean field theory, the authors explore the behavior of such networks, focusing on the existence of depth scales that dictate the maximum depth for signal propagation. The paper reveals how these depth scales serve as critical parameters in determining the trainability of neural networks.

Core Contributions

The authors outline several key contributions. Firstly, they establish the existence of characteristic depth scales that naturally emerge in random networks, constraining signal propagation. These depth scales effectively govern the maximum depth to which such networks can be trained, inherently influenced by the architectural hyperparameters. The paper introduces two primary depth scales: $\xi_q$ and $\xi_c$ , which respectively determine the span over which single-input signal magnitudes and correlations between multiple inputs can propagate. Notably, $\xi_c$ diverges at the edge of chaos, allowing signal propagation through arbitrarily deep networks when initialized near criticality.

Moreover, the paper extends to incorporate dropout, demonstrating that even minimal dropout disrupts the critical point of the order-to-chaos transition, thereby imposing a finite limit on the trainable network depth. Additionally, the authors develop a mean field theory for backpropagation, linking ordered and chaotic phases to vanishing and exploding gradients. Through empirical experimentation using standard datasets like MNIST and CIFAR10, the paper verifies the theoretical bounds on trainable network depth, establishing the condition that training is feasible when the network depth is not significantly larger than $\xi_c$ .

Implications and Impact

Practically, the insights from this research offer a robust framework for selecting hyperparameters and designing network architectures. The notion of depth scales as a universal constraint implies that irrespective of the dataset, the architecture itself delineates the feasible training regime, potentially guiding more sophisticated initialization strategies and architectural design processes.

Theoretically, the paper contributes to the broader understanding of information dynamics in neural networks. By mapping neural network behavior to physical systems exhibiting phase transitions, the authors provide a formalism that captures the intricacies of signal and gradient propagation in deep networks. Such insights could refine models depicting learning and adaptation in highly expressive models, offering a foundation for future explorations into network architecture and training paradigms.

Future Directions

The findings pave the way for investigating further architectural paradigms beyond fully connected networks, such as convolutional networks, where structured weight matrices could present new dynamics in information propagation. In scenarios where criticality cannot be reached, as demonstrated with dropout, the development of alternative methods to facilitate deeper network training is crucial. Exploring techniques like batch normalization or orthogonal initializations in conjunction with mean field predictions may offer new training strategies.

Additionally, exploring the application of these depth scales in architectures with unbounded activations, such as ReLU networks, will be vital in understanding how these principles can be adapted to contemporary model architectures. Given the promising results, applying similar theoretical principles to recurrent networks and understanding their capacity constraints could also be an intriguing research avenue.

In sum, this paper significantly advances the understanding of how information propagates through neural networks initialized with randomness, highlighting critical parameters that underpin the trainability of deep learning systems. Through both theoretical modeling and empirical validation, it sets the stage for more informed architectural designs and future explorations into the depths of trainable AI models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/itamarlandau/status/1757104031183634698

YouTube

Show All Videos