Feature Learning in Infinite-Width Neural Networks (2011.14522v3)

Published 30 Nov 2020 in cs.LG, cond-mat.dis-nn, and cs.NE

Abstract: As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the Tensor Programs technique, we derive explicit formulas for such limits. On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finite-width networks, with the latter approaching the infinite-width feature learning performance as width increases. More generally, we classify a natural space of neural network parametrizations that generalizes standard, NTK, and Mean Field parametrizations. We show 1) any parametrization in this space either admits feature learning or has an infinite-width training dynamics given by kernel gradient descent, but not both; 2) any such infinite-width limit can be computed using the Tensor Programs technique. Code for our experiments can be found at github.com/edwardjhu/TP4.

View on arXiv

Authors (2)

Greg Yang (35 papers)
Edward J. Hu (7 papers)

Citations (132)

View on Semantic Scholar

Summary

Infinite-Width Neural Networks: Feature Learning Versus Kernel Regimes

The present paper provides a comprehensive paper on the behavior of infinite-width neural networks through the lens of abc-parametrizations. The research identifies a fundamental limitation in the existing theories of infinite-width neural networks, especially those derived from the Neural Tangent Kernel (NTK) framework, and provides a novel approach that inherently supports feature learning.

Key Insights and Contributions

Limited Scope of NTK: The paper systematically demonstrates that the commonly used parametrizations (standard and NTK) fail to admit feature learning in the infinite-width limit. This limitation is crucial as feature learning is indispensable for effective pretraining and transfer learning, foundational processes in modern deep learning models like BERT and Word2Vec.
Introduction of abc-Parametrizations: The authors propose a generalized class of parametrizations, termed abc-Parametrizations, which encompass standard, NTK, and Mean Field parametrizations. This framework aids in the classification of neural network parametrizations into two exclusive regimes:
- Feature Learning Regime: Networks that can evolve their embeddings during training.
- Kernel Regime: Networks whose infinite-width training dynamics are governed by kernel gradient descent.
Dynamical Dichotomy Theorem: The paper introduces the "Dynamical Dichotomy" theorem, establishing that a stable, nontrivial abc-parametrization will either enable feature learning or conform to the kernel regime, but not both. This result effectively debunks the viability of intermediate paradigms like the higher-order NTK dynamics in this setting.
Maximal Update Parametrization (μP): To address the deficiencies of standard and NTK parametrizations, the paper proposes the Maximal Update Parametrization (μP). μP permits maximal feature learning by setting appropriate scaling factors for weights and learning rates. The authors derive the infinite-width limits of μP using the Tensor Programs technique, a formalism that proves the universality and convergence properties of such computations.

Experimental Validation

The paper validates its theoretical claims through detailed experiments on canonical tasks that rely heavily on feature learning: Word2Vec and few-shot learning on Omniglot via MAML.

Word2Vec Pretraining: The infinite-width μP networks significantly outperform both finite-width networks and NTK baselines in the word analogy task. As illustrated in Figure 1, the embeddings trained with μP effectively capture the semantic relationships, unlike the essentially random embeddings from NTK models.
Few-Shot Learning with MAML: On the Omniglot dataset, the paper shows that infinite-width μP networks achieve superior meta-test accuracies compared to both finite-width networks and NTK baselines. This emphatically demonstrates the critical role of feature learning in tasks that require rapid adaptation.

Theoretical and Practical Implications

The implications of this research are multifaceted:

Theoretical: The introduction of abc-parametrizations and the Dynamical Dichotomy theorem provide a systematic framework to analyze and understand the diverse behaviors of neural networks in the infinite-width limit. This theoretical groundwork paves the way for future research in understanding the deep learning models' training dynamics and generalization properties.
Practical: The proposed μP can be readily adopted in pretraining and transfer learning frameworks to achieve better feature representations. The Tensor Programs technique further offers a robust tool for deriving and analyzing infinite-width network limits, extending beyond traditional supervised learning to reinforcement learning, self-supervised learning, and generative models.

Future Directions

The paper opens several avenues for future research:

Extension to Non-Standard Architectures: While the paper focuses on MLPs, the Tensor Programs technique is versatile enough to be applied to more complex architectures like ResNets and Transformers.
Optimization and Scalability: Further exploration into optimizing the computational aspects of evaluating the infinite-width limits using Tensor Programs, particularly for deeper networks and non-polynomial activations, could enhance practical usability.
Interdisciplinary Applications: Extending the framework to interdisciplinary domains such as molecular biology, neuroscience, and physics, where understanding feature learning can lead to significant breakthroughs.

This research marks a significant step in the broader understanding of neural network behavior in high-dimensional settings. The Maximal Update Parametrization and the Tensor Programs technique collectively address longstanding challenges in the field, bridging the gap between theory and practice, and setting a robust foundation for future explorations in deep learning.