Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

149

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit (2309.16620v2)

Published 28 Sep 2023 in stat.ML, cond-mat.dis-nn, cs.AI, and cs.LG

Abstract: The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.

References (67)

Authors (5)

Blake Bordelon (27 papers)
Lorenzo Noci (17 papers)
Mufan Bill Li (10 papers)
Boris Hanin (50 papers)
Cengiz Pehlevan (81 papers)

Citations (14)

View on Semantic Scholar

Summary

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

The paper "Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit" addresses the ongoing challenge associated with hyperparameter tuning in deep learning models, particularly as model sizes increase. The authors focus on hyperparameters transferability across varying depths and widths of neural networks, which is a critical consideration given the computational costs associated with hyperparameter optimization in state-of-the-art (SOTA) models with vast numbers of parameters.

Residual networks (ResNets) are central to this paper, particularly their notoriously large demands for computational resources during hyperparameter tuning. ResNets, as well as Vision Transformers, are both highlighted for their scalability and capability to transfer hyperparameters across varying network configurations with depth adjustments using a novel approach.

Key Contributions

$\mu$ P Parameterization and Limitations

The paper notes that $\mu$ P parameterization has been effective for transferring hyperparameters from narrower to wider models, but does not necessarily ensure transferability across network depths. The paper questions the capacity for simultaneous hyperparameter transfer across both dimensions.

Depthwise $\frac{1}{\sqrt{\text{depth}}}$ Scaling

To address existing limitations, the authors propose a depthwise scaling mechanism via the $1/\sqrt{\text{depth}}$ factor for residual branches, coupled with traditional $\mu$ P parameterization. This scaling aims to stabilize dynamic learning processes and enable hyperparameter transferability across both width and depth dimensions. The experimental results reveal that this approach indeed facilitates learning rate transfer and maintains consistent learning dynamics across varying network widths and depths.

Empirical and Theoretical Verification

Empirical findings from experiments with convolutional ResNets and Vision Transformers, trained on CIFAR-10, Tiny ImageNet, and ImageNet datasets, validate the proposed parameterization's efficacy. Importantly, this empirical verification is substantiated through theoretical analysis using dynamical mean field theory (DMFT) to describe neural network learning dynamics. This approach reveals that the proposed parameterization leads to a stable feature-learning regime and a convergent behavior in large-width and large-depth limits.

Implications and Future Directions

The ability to reliably transfer hyperparameters across expansive network configurations has practical implications in reducing the overheads associated with manual tuning, leading to more efficient training processes in massive models. The theoretical insights on scaling mechanisms provide a foundation for exploring other scaling strategies that may similarly stabilize neural network dynamics.

Looking forward, the paper opens avenues for investigating joint scaling limits involving dataset sizes and optimization steps, considering the landscape in a future multiscale framework. The discussion also alludes to potential further exploration into different layer time dynamics to inform better architectural choices and hyperparameter strategies.

Conclusion

In conclusion, the paper makes a significant contribution to the understanding of hyperparameter transferability in neural networks, especially regarding deep and wide models. By introducing and validating a new parameterization for residual networks that enables consistent feature learning and stable dynamics across both network width and depth, it sets a foundation for more efficient tuning processes. Through a blend of empirical and theoretical analyses, it paves the way for reducing computational barriers in hyperparameter optimization, enhancing the scalability and usability of large-scale deep learning models.

PDF Markdown

Tweets

https://twitter.com/blake__bordelon/status/1707714685426880842

https://twitter.com/lorenzo_noci/status/1748010700755100140

https://twitter.com/lorenzo_noci/status/1752275476310024691

https://twitter.com/EIFY/status/1788438125368520881