Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks
(2310.02244v5)
Published 3 Oct 2023 in cs.NE, cond-mat.dis-nn, and math.PR
Abstract: By classifying infinite-width neural networks and identifying the optimal limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for widthwise hyperparameter transfer, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for depthwise parametrizations of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify feature diversity as a crucial factor in deep networks, and Depth-$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.
The paper introduces Depth-muP, a new parameterization that scales learning rates and block multipliers to stabilize training and enhance feature diversity (achieving a maximal exponent of 1/2) in deep ResNets.
The paper provides rigorous mathematical analysis and empirical experiments demonstrating that Depth-muP outperforms standard schemes in maintaining training stability and facilitating hyperparameter transfer.
The paper offers a practical framework to mitigate challenges like exploding/vanishing gradients, enabling efficient feature learning in infinitely deep neural networks.
Overview of "Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks"
The paper by Yang et al. investigates the parameterization of deep residual networks (ResNets) focusing on the depthwise scaling of hyperparameters, specifically within infinitely deep neural networks. Building on previous work that dealt with the scaling of infinite-width neural networks, the authors propose a novel parameterization scheme, termed Depth-μP. This work extends the Maximum Update Parametrization (μP), which was developed for optimal scaling of neural networks in their width, to scenarios where network depth is the primary variable being scaled.
Depth-μP Parametrization
The core of the paper is the development and exploration of Depth-μP, which provides a framework for scaling the learning rate and block multipliers in a manner that ensures network stability, feature learning, and hyperparameter transfer across depths. In the context of ResNets, where each residual block contains one layer, Depth-μP allows for stable propagation of signals and gradients across an infinitely deep architecture.
Depth-μP achieves two critical objectives: maximizing feature learning and feature diversity. The authors define feature diversity through a parameter called the feature diversity exponent. Depth-μP is shown to achieve a maximal feature diversity exponent of 1/2, indicating optimal use of network capacity by ensuring changes in output between consecutive layers.
Mathematical and Empirical Insights
The authors perform a thorough analysis of different parameterization schemes. The work classifies these into stable, nontrivial parameterizations and examines their implications on feature learning. Demonstrating that the feature diversity is maximal specifically for Depth-μP establishes its theoretical superiority over other parametrizations that might encourage stability but reduce learning diversity.
Empirically, the paper verifies the predicted behaviors through experiments on simple deep and linear networks, and comparisons are made with standard parameterizations and other scaling strategies like the ODE scaling. Crucially, Depth-μP was the only scheme that maintained consistent hyperparameter transfer and optimal network performance as depth increases.
Implications for Deep Learning
The implications of Depth-μP are significant in the context of training very deep networks. By effectively stabilizing training dynamics and enabling hyperparameter transfer, this approach proposes a solution to the common issues that arise when scaling network depths, such as exploding or vanishing gradients and necessity for extensive hyperparameter tuning. In practical terms, this means that models can be trained at greater depths without requiring a proportional increase in computational overhead to explore the hyperparameter space.
Future Directions
Future research will likely explore the application of Depth-μP to networks beyond ResNets, such as Transformers, which often require multiple layers within each block (i.e., block depth greater than one). Given the paper's finding that diminishing returns occur in block depth greater than one when using a 1/L scaling, identifying optimal scaling rules for these architectures remains an open question. Additionally, exploring further the relationship between layerwise linearization and feature diversity may reveal deeper theoretical insights.
In conclusion, this paper offers a comprehensive approach to parameterizing infinitely deep networks that not only provides mathematical rigor in its derivation but also a robust empirical foundation for its claims about feature diversity and hyperparameter stability. Depth-μP stands as a vital step forward for efficiently scaling the depth of neural networks without compromising their learning capabilities.