Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit (2309.16620v2)

Published 28 Sep 2023 in stat.ML, cond-mat.dis-nn, cs.AI, and cs.LG

Abstract: The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Joseph M Antognini. Finite size corrections for neural network gaussian processes. arXiv preprint arXiv:1908.10030, 2019.
  2. Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp.  1352–1361. PMLR, 2021.
  3. The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pp. 342–350. PMLR, 2017.
  4. On the distance between two neural networks and the stability of learning. Advances in Neural Information Processing Systems, 33:21370–21381, 2020.
  5. Automatic gradient descent: Deep learning without hyperparameters. arXiv preprint arXiv:2304.05187, 2023.
  6. The influence of learning rule on representation dynamics in wide neural networks. arXiv preprint arXiv:2210.02157, 2022a.
  7. Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems, 35:32240–32256, 2022b.
  8. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. arXiv preprint arXiv:2304.03408, 2023.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
  11. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pp.  1305–1338. PMLR, 2020.
  12. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
  13. Kernel methods for deep learning. Advances in neural information processing systems, 22, 2009.
  14. Neural signature kernels as infinite-width-depth-limits of controlled resnets. arXiv preprint arXiv:2303.17671, 2023.
  15. Batch normalization orthogonalizes representations in deep random networks. Advances in Neural Information Processing Systems, 34:4896–4906, 2021.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  17. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  18. Optimal signal propagation in resnets through residual scaling. arXiv preprint arXiv:2305.07715, 2023.
  19. Rigorous dynamical mean field theory for stochastic gradient descent methods. arXiv preprint arXiv:2210.06591, 2022.
  20. Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? Advances in neural information processing systems, 31, 2018.
  21. Boris Hanin. Correlation functions in random fully connected neural networks at finite width. arXiv preprint arXiv:2204.01058, 2022.
  22. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
  23. Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, 376(1):287–322, 2020.
  24. Bayesian interpolation with deep linear networks. Proceedings of the National Academy of Sciences, 120(23):e2301345120, 2023.
  25. Soufiane Hayou. On the infinite-depth limit of finite-width neural networks. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=RbLsYz1Az9.
  26. Width and depth limits commute in residual networks. arXiv preprint arXiv:2302.00453, 2023.
  27. Stable resnet. In International Conference on Artificial Intelligence and Statistics, pp.  1324–1332. PMLR, 2021.
  28. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  29. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  30. Improving transformer optimization through better initialization. In International Conference on Machine Learning, pp. 4475–4483. PMLR, 2020.
  31. Kiyosi Itô. 109. stochastic integral. Proceedings of the Imperial Academy, 20(8):519–524, 1944.
  32. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  33. Depth dependence of μ𝜇\muitalic_μp learning rates in relu mlps. arXiv preprint arXiv:2305.07810, 2023.
  34. On the impact of activation and normalization in obtaining isometric embeddings at initialization. arXiv preprint arXiv:2305.18399, 2023.
  35. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  36. Scaling laws for deep learning based image reconstruction. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=op-ceGueqc4.
  37. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
  38. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
  39. The future is log-gaussian: Resnets and their infinite-depth-and-width limit at initialization. Advances in Neural Information Processing Systems, 34:7852–7864, 2021.
  40. The neural covariance sde: Shaped infinite depth-and-width networks at initialization. Advances in Neural Information Processing Systems, 35:10795–10808, 2022.
  41. Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Physical Review X, 11(3):031059, 2021.
  42. Statistical dynamics of classical systems. Physical Review A, 8(1):423, 1973.
  43. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
  44. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pp.  2388–2464. PMLR, 2019.
  45. Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. Advances in Neural Information Processing Systems, 33:9540–9550, 2020.
  46. Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
  47. Precise characterization of the prior predictive distribution of deep relu networks. Advances in Neural Information Processing Systems, 34:20851–20862, 2021.
  48. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
  49. The shaped transformer: Attention models in the infinite depth-and-width limit. arXiv preprint arXiv:2306.17759, 2023.
  50. OpenAI. Gpt-4 technical report, 2023.
  51. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  52. Walter Rudin. Principles of mathematical analysis. 1953.
  53. Unified field theoretical approach to deep and recurrent neuronal networks. Journal of Statistical Mechanics: Theory and Experiment, 2022(10):103401, 2022.
  54. Masato Taki. Deep residual networks and weight initialization. arXiv preprint arXiv:1709.02956, 2017.
  55. Dynamical isometry is achieved in residual networks in a universal way for any activation function. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  2221–2230. PMLR, 2019.
  56. Llama: open and efficient foundation language models, 2023. URL https://arxiv. org/abs/2302.13971.
  57. Feature-learning networks are consistent across widths at realistic scales, 2023.
  58. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524–10533. PMLR, 2020.
  59. Sho Yaida. Non-gaussian processes and neural networks at finite widths. In Mathematical and Scientific Machine Learning, pp. 165–192. PMLR, 2020.
  60. Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems, 34:17084–17097, 2021.
  61. Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.
  62. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pp. 11727–11737. PMLR, 2021.
  63. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
  64. Advances in Neural Information Processing Systems, 34:3364–3375, 2021.
  65. Asymptotics of representation learning in finite bayesian neural networks. Advances in neural information processing systems, 34:24765–24777, 2021.
  66. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12104–12113, 2022.
  67. Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Blake Bordelon (27 papers)
  2. Lorenzo Noci (17 papers)
  3. Mufan Bill Li (10 papers)
  4. Boris Hanin (50 papers)
  5. Cengiz Pehlevan (81 papers)
Citations (14)

Summary

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

The paper "Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit" addresses the ongoing challenge associated with hyperparameter tuning in deep learning models, particularly as model sizes increase. The authors focus on hyperparameters transferability across varying depths and widths of neural networks, which is a critical consideration given the computational costs associated with hyperparameter optimization in state-of-the-art (SOTA) models with vast numbers of parameters.

Residual networks (ResNets) are central to this paper, particularly their notoriously large demands for computational resources during hyperparameter tuning. ResNets, as well as Vision Transformers, are both highlighted for their scalability and capability to transfer hyperparameters across varying network configurations with depth adjustments using a novel approach.

Key Contributions

μ\muP Parameterization and Limitations

The paper notes that μ\muP parameterization has been effective for transferring hyperparameters from narrower to wider models, but does not necessarily ensure transferability across network depths. The paper questions the capacity for simultaneous hyperparameter transfer across both dimensions.

Depthwise 1depth\frac{1}{\sqrt{\text{depth}}} Scaling

To address existing limitations, the authors propose a depthwise scaling mechanism via the 1/depth1/\sqrt{\text{depth}} factor for residual branches, coupled with traditional μ\muP parameterization. This scaling aims to stabilize dynamic learning processes and enable hyperparameter transferability across both width and depth dimensions. The experimental results reveal that this approach indeed facilitates learning rate transfer and maintains consistent learning dynamics across varying network widths and depths.

Empirical and Theoretical Verification

Empirical findings from experiments with convolutional ResNets and Vision Transformers, trained on CIFAR-10, Tiny ImageNet, and ImageNet datasets, validate the proposed parameterization's efficacy. Importantly, this empirical verification is substantiated through theoretical analysis using dynamical mean field theory (DMFT) to describe neural network learning dynamics. This approach reveals that the proposed parameterization leads to a stable feature-learning regime and a convergent behavior in large-width and large-depth limits.

Implications and Future Directions

The ability to reliably transfer hyperparameters across expansive network configurations has practical implications in reducing the overheads associated with manual tuning, leading to more efficient training processes in massive models. The theoretical insights on scaling mechanisms provide a foundation for exploring other scaling strategies that may similarly stabilize neural network dynamics.

Looking forward, the paper opens avenues for investigating joint scaling limits involving dataset sizes and optimization steps, considering the landscape in a future multiscale framework. The discussion also alludes to potential further exploration into different layer time dynamics to inform better architectural choices and hyperparameter strategies.

Conclusion

In conclusion, the paper makes a significant contribution to the understanding of hyperparameter transferability in neural networks, especially regarding deep and wide models. By introducing and validating a new parameterization for residual networks that enables consistent feature learning and stable dynamics across both network width and depth, it sets a foundation for more efficient tuning processes. Through a blend of empirical and theoretical analyses, it paves the way for reducing computational barriers in hyperparameter optimization, enhancing the scalability and usability of large-scale deep learning models.