Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unifying Low Dimensional Observations in Deep Learning Through the Deep Linear Unconstrained Feature Model (2404.06106v1)

Published 9 Apr 2024 in cs.LG

Abstract: Modern deep neural networks have achieved high performance across various tasks. Recently, researchers have noted occurrences of low-dimensional structure in the weights, Hessian's, gradients, and feature vectors of these networks, spanning different datasets and architectures when trained to convergence. In this analysis, we theoretically demonstrate these observations arising, and show how they can be unified within a generalized unconstrained feature model that can be considered analytically. Specifically, we consider a previously described structure called Neural Collapse, and its multi-layer counterpart, Deep Neural Collapse, which emerges when the network approaches global optima. This phenomenon explains the other observed low-dimensional behaviours on a layer-wise level, such as the bulk and outlier structure seen in Hessian spectra, and the alignment of gradient descent with the outlier eigenspace of the Hessian. Empirical results in both the deep linear unconstrained feature model and its non-linear equivalent support these predicted observations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
  2. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  4. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
  5. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  6. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
  7. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
  8. Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. ArXiv, abs/1811.07062, 2018.
  9. Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. arXiv preprint arXiv:1901.08244, 2019.
  10. Vardan Papyan. Traces of class/cross-class structure pervade deep learning spectra. Journal of Machine Learning Research, 21(252):1–64, 2020.
  11. An investigation into neural net optimization via hessian eigenvalue density. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2232–2241. PMLR, 09–15 Jun 2019.
  12. The break-even point on optimization trajectories of deep neural networks. arXiv preprint arXiv:2002.09572, 2020.
  13. Hessian based analysis of sgd for deep nets: Dynamics and generalization. In Proceedings of the 2020 SIAM International Conference on Data Mining, pages 190–198. SIAM, 2020.
  14. Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian. arXiv preprint arXiv:1906.05392, 2019.
  15. Traditional and heavy tailed self regularization in neural network models. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4284–4293. PMLR, 09–15 Jun 2019.
  16. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11):9481–9488, May 2021.
  17. Gradient descent happens in a tiny subspace. ArXiv, abs/1812.04754, 2018.
  18. High-dimensional SGD aligns with emerging outlier eigenspaces. In The Twelfth International Conference on Learning Representations, 2024.
  19. Gradient descent for low-rank functions. ArXiv, abs/2206.08257, 2022.
  20. Low dimensional trajectory hypothesis is true: Dnns can be trained in tiny subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:3411–3420, 2022.
  21. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences of the United States of America, 117:24652 – 24663, 2020.
  22. Neural collapse under MSE loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations, 2022.
  23. A law of data separation in deep learning. Proceedings of the National Academy of Sciences of the United States of America, 120, 2022.
  24. Feature learning in deep classifiers through intermediate neural collapse. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28729–28745. PMLR, 23–29 Jul 2023.
  25. Neural collapse in the intermediate hidden layers of classification neural networks. ArXiv, abs/2308.02760, 2023.
  26. Neural collapse: A review on modelling principles and generalization. Trans. Mach. Learn. Res., 2023, 2022.
  27. Neural collapse with unconstrained features. Sampling Theory, Signal Processing, and Data Analysis, 20, 2020.
  28. Neural collapse in deep linear networks: From balanced to imbalanced data. In International Conference on Machine Learning, 2023.
  29. Deep neural collapse is provably optimal for the deep unconstrained features model. In Advances in Neural Information Processing Systems, volume 36, pages 52991–53024. Curran Associates, Inc., 2023.
  30. Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning, 2022.
  31. On the optimization landscape of neural collapse under MSE loss: Global optimality with unconstrained features. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27179–27202. PMLR, 17–23 Jul 2022.
  32. An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations, 2022.
  33. A geometric analysis of neural collapse with unconstrained features. In Advances in Neural Information Processing Systems, volume 34, pages 29820–29834. Curran Associates, Inc., 2021.
  34. Neural collapse with normalized features: A geometric analysis over the riemannian manifold. In Advances in Neural Information Processing Systems, volume 35, pages 11547–11560. Curran Associates, Inc., 2022.
  35. Are all losses created equal: A neural collapse perspective. In Advances in Neural Information Processing Systems, volume 35, pages 31697–31710. Curran Associates, Inc., 2022.
  36. Imbalance trouble: Revisiting neural-collapse geometry. In Advances in Neural Information Processing Systems, volume 35, pages 27225–27238. Curran Associates, Inc., 2022.
  37. Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network? In Advances in Neural Information Processing Systems, volume 35, pages 37991–38002. Curran Associates, Inc., 2022.
  38. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences of the United States of America, 118, 2021.
  39. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
  40. The spectrum of the fisher information matrix of a single-hidden-layer neural network. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  41. Universal characteristics of deep neural network loss surfaces from random matrix theory. Journal of Physics A: Mathematical and Theoretical, 55(49):494002, 2022.
  42. The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 192–204, San Diego, California, USA, 09–12 May 2015. PMLR.
  43. The loss surfaces of neural networks with general activation functions. Journal of Statistical Mechanics: Theory and Experiment, 2021(6):064001, 2021.
  44. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. In Advances in Neural Information Processing Systems, volume 33, pages 7710–7721. Curran Associates, Inc., 2020.
  45. The asymptotic spectrum of the hessian of dnn throughout training. In International Conference on Learning Representations, 2020.
  46. Dissecting hessian: Understanding common structure of hessian in neural networks. ArXiv, abs/2010.04261, 2020.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets