Papers
Topics
Authors
Recent
2000 character limit reached

The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networks (2311.18718v3)

Published 30 Nov 2023 in cs.LG

Abstract: Deep learning succeeds by doing hierarchical feature learning, yet tuning hyper-parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we introduce a key notion to predict and control feature learning: the angle $\theta_\ell$ between the feature updates and the backward pass (at layer index $\ell$). We show that the magnitude of feature updates after one GD step, at any training time, can be expressed via a simple and general \emph{feature speed formula} in terms of this angle $\theta_\ell$, the loss decay, and the magnitude of the backward pass. This angle $\theta_\ell$ is controlled by the conditioning of the layer-to-layer Jacobians and at random initialization, it is determined by the spectrum of a certain kernel, which coincides with the Neural Tangent Kernel when $\ell=\text{depth}$. Given $\theta_\ell$, the feature speed formula provides us with rules to adjust HPs (scales and learning rates) so as to satisfy certain dynamical properties, such as feature learning and loss decay. We investigate the implications of our approach for ReLU MLPs and ResNets in the large width-then-depth limit. Relying on prior work, we show that in ReLU MLPs with iid initialization, the angle degenerates with depth as $\cos(\theta_\ell)=\Theta(1/\sqrt{\ell})$. In contrast, ResNets with branch scale $O(1/\sqrt{\text{depth}})$ maintain a non-degenerate angle $\cos(\theta_\ell)=\Theta(1)$. We use these insights to recover key properties of known HP scalings and also to introduce a new HP scaling for large depth ReLU MLPs with favorable theoretical properties.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. A convergence theory for deep learning via over-parameterization. In International conference on machine learning, pages 242–252. PMLR, 2019.
  2. Neural networks as kernel learners: The silent alignment effect. In International Conference on Learning Representations, 2021.
  3. Implicit regularization via neural feature alignment. In International Conference on Artificial Intelligence and Statistics, pages 2269–2277. PMLR, 2021.
  4. Deep equals shallow for ReLU networks in kernel regimes. In ICLR 2021-International Conference on Learning Representations, pages 1–22, 2021.
  5. A mathematical model for automatic differentiation in machine learning. Advances in Neural Information Processing Systems, 33:10809–10819, 2020.
  6. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit. arXiv preprint arXiv:2309.16620, 2023.
  7. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in Neural Information Processing Systems, 31, 2018.
  8. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019.
  9. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.
  10. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  11. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Advances in Neural Information Processing Systems, 33:7710–7721, 2020.
  12. Non-gaussian tensor programs. Advances in Neural Information Processing Systems, 35:21521–21533, 2022.
  13. Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? Advances in neural information processing systems, 31, 2018.
  14. Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, 376(1):287–322, 2020.
  15. How to start training: The effect of initialization and architecture. Advances in Neural Information Processing Systems, 31, 2018.
  16. Stable ResNet. In International Conference on Artificial Intelligence and Statistics, pages 1324–1332. PMLR, 2021.
  17. Revisiting the Polyak step size. arXiv preprint arXiv:1905.00313, 2019.
  18. Neural Tangent Kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31, 2018.
  19. Depth dependence of μ𝜇\muitalic_μ-P learning rates in ReLU MLPs. arXiv preprint arXiv:2305.07810, 2023.
  20. Universal statistics of Fisher information in deep neural networks: Mean field approach. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1032–1041. PMLR, 2019.
  21. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  22. Deep learning. Nature, 521(7553):436–444, 2015.
  23. The future is log-Gaussian: ResNets and their infinite-depth-and-width limit at initialization. Advances in Neural Information Processing Systems, 34:7852–7864, 2021.
  24. Robust training of neural networks using scale invariant architectures. In International Conference on Machine Learning, pages 12656–12684. PMLR, 2022.
  25. Spectrum concentration in deep residual learning: a free probability approach. IEEE Access, 7:105212–105223, 2019.
  26. Feature learning and signal propagation in deep neural networks. In International Conference on Machine Learning, pages 14248–14282. PMLR, 2022.
  27. Scaling ResNets in the large-depth regime. arXiv preprint arXiv:2206.06929, 2022.
  28. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  29. Leonid Pastur. On random matrices arising in deep neural networks. Gaussian case. arXiv preprint arXiv:2001.06188, 2020.
  30. On random matrices arising in deep neural networks: General iid case. Random Matrices: Theory and Applications, 12(01):2250046, 2023.
  31. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in Neural Information Processing Systems, 30, 2017.
  32. The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, pages 1924–1932. PMLR, 2018.
  33. Boris T. Polyak. Introduction to optimization. 1987.
  34. Exponential expressivity in deep neural networks through transient chaos. Advances in Neural Information Processing Systems, 29, 2016.
  35. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. stat, 1050:22, 2018.
  36. Dynamical isometry is achieved in residual networks in a universal way for any activation function. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2221–2230. PMLR, 2019.
  37. Twan Van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
  38. Feature-learning networks are consistent across widths at realistic scales. arXiv preprint arXiv:2305.18411, 2023.
  39. Spherical motion dynamics: Learning dynamics of neural network with normalization, weight decay, and SGD. arXiv preprint arXiv:2006.08419, 2020.
  40. Spectral evolution and invariance in linear-width neural networks. arXiv preprint arXiv:2211.06506, 2022.
  41. Greg Yang. Tensor programs III: Neural matrix laws. arXiv preprint arXiv:2009.10685, 2020.
  42. Tensor programs IV: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
  43. Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems, 34:17084–17097, 2021.
  44. A spectral condition for feature learning. arXiv preprint arXiv:2310.17813, 2023a.
  45. Feature learning in infinite-depth neural networks. In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023b.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 38 likes about this paper.