Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation (1902.04760v3)

Published 13 Feb 2019 in cs.NE, cond-mat.dis-nn, cs.LG, math-ph, math.MP, and stat.ML

Abstract: Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline \emph{tensor program} that can express most neural network computations, and we characterize its scaling limit when its tensors are large and randomized. From our framework follows (1) the convergence of random neural networks to Gaussian processes for architectures such as recurrent neural networks, convolutional neural networks, residual networks, attention, and any combination thereof, with or without batch normalization; (2) conditions under which the \emph{gradient independence assumption} -- that weights in backpropagation can be assumed to be independent from weights in the forward pass -- leads to correct computation of gradient dynamics, and corrections when it does not; (3) the convergence of the Neural Tangent Kernel, a recently proposed kernel used to predict training dynamics of neural networks under gradient descent, at initialization for all architectures in (1) without batch normalization. Mathematically, our framework is general enough to rederive classical random matrix results such as the semicircle and the Marchenko-Pastur laws, as well as recent results in neural network Jacobian singular values. We hope our work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.

Citations (272)

View on Semantic Scholar

Summary

The paper demonstrates that wide neural networks with weight sharing converge to Gaussian processes under proper initialization, enhancing uncertainty quantification.
The paper justifies the Gradient Independence Assumption by outlining specific conditions to prevent vanishing or exploding gradients during training.
The paper extends Neural Tangent Kernel theory to diverse architectures by proving its convergence in the infinite-width limit, guiding future model designs.

Insights on "Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation"

The paper "Scaling Limits of Wide Neural Networks with Weight Sharing" by Greg Yang addresses key aspects of the scaling behavior in modern neural networks, providing theoretical insights into phenomena like Gaussian processes correspondence in wide networks, conditions for gradient independence, and convergence properties of the Neural Tangent Kernel (NTK). This essay aims to shed light on the structure and implications of the work, focusing on how it contributes to our understanding of deep learning systems, especially those with weight sharing.

At the heart of this research is the introduction of a framework called a "Tensor Program," which can encompass most neural network computations. Tensor Programs serve as a mechanism to provide unifying treatment to scaling limits in broad classes of neural networks, including those with architectures like multilayer perceptrons (MLPs), recurrent neural networks (RNNs), convolutional neural networks (CNNs), and others possibly employing weight sharing strategies like tied weights in RNNs or autoencoders.

Gaussian Process Behavior and Wide Neural Networks

The paper revisits the connection between neural networks and Gaussian Processes (GP), an insight first identified decades ago for certain types of networks. This work extends such conditions to a broader class of architectures (e.g., those with batch normalization, residual connections, and attention mechanisms). The result is a reinforced theory stating that under proper initialization conditions, wide neural networks converge to Gaussian processes. This DNN-GP correspondence has clear implications for Bayesian deep learning and neural network uncertainty estimation, suggesting avenues for designing more robust models with theoretically grounded performance guarantees.

Gradient Independence Assumption and Signal Propagation

A critical component of analysis in this field is the Gradient Independence Assumption, a heuristic technique often used in prior works to simplify computations of gradient covariance in neural networks. The paper formally justifies this assumption in specific settings, unveiling when it can be applied without error and, importantly, when it cannot. For instance, this assumption holds under the condition that readout layer weights are initialized independently with zero mean—a result aligned with the optimization of initialization strategies to prevent issues such as vanishing/exploding gradients and ensure the effective training of very deep networks.

Neural Tangent Kernel Convergence

Yang extends the NTK theory to a wider class of architectures without batch normalization, proving that the NTK converges in the infinite width limit—a result central to understanding the dynamics of neural network training in a lazy training regime. This convergence directly impacts how we model neural networks in the limit and aligns well with recent trends in analyzing gradient descent as a linear perturbation around neural tangent kernels at initialization.

Implications and Future Directions

The mathematical results obtained have both theoretical and practical implications. Theoretically, they provide a comprehensive foundation for the analysis of modern neural architectures, unifying existing results on random matrices with novel insights into weight-tied feedforward and recurrent processes. Practically, these insights suggest pathways to design better internship mechanisms to avoid adverse learning dynamics and motivate new architectures optimized in the infinite-width paradigm.

Furthermore, this approach invites future exploration on broader datasets and architectures, providing a grounded methodology that might influence how we conduct neural architecture searches. The promise of automating these tests illustrates the potential integration of the findings with mainstream machine learning platforms like PyTorch and TensorFlow.

In conclusion, this paper advances our comprehension of the critical scaling limit properties of wide neural networks, deepening the intricate relationship between seemingly different AI models via Gaussian Process architecture. It outlines a scalable path for future work, both in theoretical expansion and practical applications, potentially leading toward more robust AI systems in the long term.

PDF Markdown