Kernel and Rich Regimes in Overparametrized Models (2002.09277v3)

Published 20 Feb 2020 in cs.LG and stat.ML

Abstract: A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach, we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We also highlight an interesting role for the width of a model in the case that the predictor is not identically zero at initialization. We provide a complete and detailed analysis for a family of simple depth-$D$ models that already exhibit an interesting and meaningful transition between the kernel and rich regimes, and we also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.

Authors (8)

Blake Woodworth (30 papers)
Suriya Gunasekar (34 papers)
Jason D. Lee (151 papers)
Edward Moroshko (15 papers)
Pedro Savarese (14 papers)
Itay Golan (5 papers)
Daniel Soudry (76 papers)
Nathan Srebro (145 papers)

Citations (333)

View on Semantic Scholar

Summary

The paper identifies initialization scale as the key factor triggering the transition between kernel and rich regimes in neural networks.
The analysis demonstrates that model depth accelerates the regime shift while width influences parameter scale, affecting implicit bias.
Empirical evaluations across varied architectures validate a novel implicit bias function that interpolates between ℓ1 and ℓ2 norms.

Kernel and Rich Regimes in Overparametrized Models: An Expert Analysis

The paper "Kernel and Rich Regimes in Overparametrized Models" addresses a critical problem in the paper of neural networks: understanding the implicit biases induced by gradient descent in overparameterized neural networks across different training regimes. Once the network is overparameterized, the landscape of possible solutions involves many global minima. The characterization of which minima are selected by gradient descent is central to understanding the ability of neural networks to generalize beyond their training data.

Main Contributions and Findings

The paper successfully explores the transition between two regimes in training overparametrized neural models: the "kernel" regime and the "rich" regime. In the kernel regime, the neural network effectively behaves as a linear kernel method, specifically a kernelized linear predictor targeting the minimum Reproducing Kernel Hilbert Space (RKHS) norm solution. This contrasts sharply with the more flexible rich regime, where the network exhibits implicit biases that cannot be expressed as RKHS norms.

Key Contributions:

Initialization Scale as a Transition Mechanism: The authors identify the scale of initialization as the crucial factor governing the transition between the two regimes. This is theoretically grounded by demonstrating that as initial scale goes to infinity, models enter the kernel regime, while smaller scales shift them towards the rich regime.
Depth and Width Effects: Detailed theoretical analysis is provided for depth- $D$ homogeneous models, showing how increasing the depth hastens this transition. Conversely, the paper also illustrates the role of width in matrix factorization, highlighting how width influences parameter magnitude and scale, thus playing a significant role in determining regime behavior.
Implicit Bias Characterization: The authors provide a novel characterization of implicit bias for both regimes. In particular, they derive the exact functional form $Q_\alpha(\beta)$ that emerges as the implicit bias depending on the scale $\alpha$ , and demonstrate how this function interpolates between $\ell_1$ and $\ell_2$ norms with varying $\alpha$ .
Empirical Support: Across various network architectures, including simple linear networks, deeper models, and standard non-linear networks like VGG on CIFAR-10, the authors validate their findings empirically. They show that models near the transition threshold ( $\alpha \approx 1$ ) often perform best, minimizing either $\ell_2$ -like norm or exhibiting flexibility in feature learning.

Theoretical and Practical Implications

Theoretically, this paper enhances our understanding of how different implicit biases arise purely from the optimization process rather than explicit regularization. This provides a significant perspective on the power and limits of kernel methods and highlights the importance of initialization techniques in neural network training, impacting both convergence properties and generalization.

Practically, the realization that typical initializations place networks right at the brink between these regimes explains a wide range of empirical observations, including the difficulty in observing purely rich behaviors in practical applications, where networks are often tuned for optimal balance between kernel and rich behavior. Adjusting initialization scales assists practitioners in tailoring networks more precisely for either regime depending on task-specific needs, with potential to enhance generalization performance or optimization efficiency.

Future Directions

The paper's framework sets up several intriguing pathways for further inquiry into understanding implicit regularization in more complex neural architectures and real-world tasks. These directions include:

Exploring Intermediate Regimes: Understanding models in the intermediate scale and uncovering new emergent behaviors that contribute to neural network success in practice.
Task-Specific Bias Customization: Strategizing for specific bias inducement that aligns with the nature of data, improving out-of-the-box problem-solving capabilities.
Cross-Disciplinary Adaptations: Applying insights from this work to other fields leveraging neural networks, such as reinforcement learning or unsupervised representation learning.

In conclusion, "Kernel and Rich Regimes in Overparametrized Models" offers critical insights into the dual nature of neural network training and lays foundational knowledge for further exploration in this expanding field. It serves as a valuable resource for researchers striving to reconcile theoretical models with empirically successful practices.

PDF Markdown