Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Symmetry Induces Structure and Constraint of Learning (2309.16932v2)

Published 29 Sep 2023 in cs.LG and stat.ML

Abstract: Due to common architecture designs, symmetries exist extensively in contemporary neural networks. In this work, we unveil the importance of the loss function symmetries in affecting, if not deciding, the learning behavior of machine learning models. We prove that every mirror-reflection symmetry, with reflection surface $O$, in the loss function leads to the emergence of a constraint on the model parameters $\theta$: $OT\theta =0$. This constrained solution becomes satisfied when either the weight decay or gradient noise is large. Common instances of mirror symmetries in deep learning include rescaling, rotation, and permutation symmetry. As direct corollaries, we show that rescaling symmetry leads to sparsity, rotation symmetry leads to low rankness, and permutation symmetry leads to homogeneous ensembling. Then, we show that the theoretical framework can explain intriguing phenomena, such as the loss of plasticity and various collapse phenomena in neural networks, and suggest how symmetries can be used to design an elegant algorithm to enforce hard constraints in a differentiable way.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Loss of plasticity in continual deep reinforcement learning. arXiv preprint arXiv:2303.07507, 2023.
  2. Negative eigenvalues of the hessian in deep neural networks. arXiv preprint arXiv:1902.02366, 2019.
  3. Fixing a broken elbo. In International Conference on Machine Learning, pp. 159–168. PMLR, 2018.
  4. S. Amari and H. Nagaoka. Methods of Information Geometry. Translations of mathematical monographs. American Mathematical Society, 2007. ISBN 9780821843024. URL https://books.google.co.jp/books?id=vc2FWSo7wLUC.
  5. Philip W Anderson. More is different: Broken symmetry and the nature of the hierarchical structure of science. Science, 177(4047):393–396, 1972.
  6. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
  7. Stochastic collapse: How gradient noise attracts sgd dynamics towards simpler subnetworks. arXiv preprint arXiv:2306.04251, 2023.
  8. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
  9. Dynamics of learning in multilayer perceptrons near singularities. IEEE Transactions on Neural Networks, 19(8):1313–1328, 2008.
  10. Diagnosing and enhancing vae models. arXiv preprint arXiv:1903.05789, 2019.
  11. Quasi-stationary distributions for stochastic processes with an absorbing state. Journal of Physics A: Mathematical and General, 35(5):1147, 2002.
  12. Sharp Minima Can Generalize For Deep Nets. ArXiv e-prints, March 2017.
  13. Maintaining plasticity in deep continual learning. arXiv preprint arXiv:2306.13812, 2023.
  14. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp. 2793–2803. PMLR, 2021.
  15. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
  16. Kenji Fukumizu. A regularity condition of the information matrix of a multilayer perceptron network. Neural networks, 9(5):871–879, 1996.
  17. Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural networks, 13(3):317–327, 2000.
  18. On the role of neural collapse in transfer learning. arXiv preprint arXiv:2112.15121, 2021.
  19. An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pp. 2232–2241. PMLR, 2019.
  20. Haye Hinrichsen. Non-equilibrium critical phenomena and phase transitions into absorbing states. Advances in physics, 49(7):815–958, 2000.
  21. Minimal model of permutation symmetry in unsupervised learning. Journal of Physics A: Mathematical and Theoretical, 52(41):414001, 2019.
  22. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
  23. Pathological spectra of the fisher information metric and its variants in deep neural networks. arXiv preprint arXiv:1910.05992, 2019a.
  24. Universal statistics of fisher information in deep neural networks: Mean field approach. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  1032–1041. PMLR, 2019b.
  25. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  26. Symmetry, saddle points, and global optimization landscape of nonconvex matrix factorization. IEEE Transactions on Information Theory, 65(6):3489–3514, 2019.
  27. Don’t blame the elbo! a linear vae perspective on posterior collapse, 2019.
  28. Understanding plasticity in neural networks. arXiv preprint arXiv:2303.01486, 2023.
  29. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):53–71, 2008.
  30. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
  31. James R Norris. Markov chains. Number 2. Cambridge university press, 1998.
  32. Vardan Papyan. The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062, 2018.
  33. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  34. Neural collapse in deep homogeneous classifiers and the role of weight decay. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  4243–4247. IEEE, 2022.
  35. Dynamics and neural collapse in deep classifiers trained with the square loss. Center for Brains, Minds and Machines (CBMM) Memo, (117), 2021.
  36. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
  37. Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
  38. Are saddles good enough for deep learning? arXiv preprint arXiv:1706.02052, 2017.
  39. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
  40. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pp. 9722–9732. PMLR, 2021.
  41. Maximum-margin matrix factorization. Advances in neural information processing systems, 17, 2004.
  42. Yuandong Tian. Deep contrastive learning is provably (almost) principal component analysis. arXiv preprint arXiv:2201.12680, 2022.
  43. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  44. Ryan J Tibshirani. Equivalences between sparse models and neural networks. Working Notes. URL https://www. stat. cmu. edu/~ ryantibs/papers/sparsitynn. pdf, 2021.
  45. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
  46. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  47. Label noise (stochastic) gradient descent implicitly solves the lasso for quadratic parametrisation. In Conference on Learning Theory, pp.  2127–2159. PMLR, 2022.
  48. Posterior collapse of a linear latent variable model. Advances in Neural Information Processing Systems, 35:37537–37548, 2022.
  49. Dynamics of learning near singularities in layered networks. Neural computation, 20(3):813–843, 2008.
  50. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31, 2018.
  51. Dissecting hessian: Understanding common structure of hessian in neural networks. arXiv preprint arXiv:2010.04261, 2020.
  52. spred: Solving L1 Penalty with SGD. In International Conference on Machine Learning, 2023.
  53. Sgd can converge to local maxima. In International Conference on Learning Representations, 2021.
  54. The probabilistic stability of stochastic gradient descent, 2023a.
  55. What shapes the loss landscape of self supervised learning? In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=3zSn48RUO8M.
Citations (4)

Summary

We haven't generated a summary for this paper yet.