Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Principle of Least Symmetry Breaking in Shallow ReLU Models (1912.11939v3)

Published 26 Dec 2019 in cs.LG and stat.ML

Abstract: We consider the optimization problem associated with fitting two-layer ReLU networks with respect to the squared loss, where labels are assumed to be generated by a target network. Focusing first on standard Gaussian inputs, we show that the structure of spurious local minima detected by stochastic gradient descent (SGD) is, in a well-defined sense, the \emph{least loss of symmetry} with respect to the target weights. A closer look at the analysis indicates that this principle of least symmetry breaking may apply to a broader range of settings. Motivated by this, we conduct a series of experiments which corroborate this hypothesis for different classes of non-isotropic non-product distributions, smooth activation functions and networks with a few layers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Singularities affect dynamics of learning in neuromanifolds. Neural computation, 18(5):1007–1065, 2006.
  2. Yossi Arjevani. Hidden minima in two-layer relu networks. arXiv preprint arXiv:2312.06234, 2023.
  3. Symmetry breaking in symmetric tensor decomposition. arXiv preprint arXiv:2103.06234, 2021.
  4. Analytic characterization of the hessian in shallow relu models: A tale of symmetry. Advances in Neural Information Processing Systems, 33:5441–5452, 2020.
  5. Analytic study of families of spurious minima in two-layer relu neural networks: a tale of symmetry ii. Advances in Neural Information Processing Systems, 34:15162–15174, 2021.
  6. Symmetry & critical points for a model shallow neural network. Physica D: Nonlinear Phenomena, 427:133014, 2021.
  7. Annihilation of spurious minima in two-layer relu networks. Advances in Neural Information Processing Systems, 35:37510–37523, 2022.
  8. Equivariant bifurcation, quadratic equivariants, and symmetry breaking for the standard representation of s k. Nonlinearity, 35(6):2809, 2022.
  9. Symmetry & critical points for symmetric tensor decompositions problems. arXiv preprint arXiv:2306.5319838, 2023.
  10. Maximal subgroups of finite groups. J. Algebra, 92(1):44–80, 1985.
  11. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape. arXiv preprint arXiv:1907.02911, 2019.
  12. Globally optimal gradient descent for a convnet with gaussian inputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 605–614. JMLR. org, 2017.
  13. SGD learns over-parameterized networks that provably generalize on linearly separable data. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
  14. Proof of the satisfiability conjecture for large k. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 59–68, 2015.
  15. Permutation groups, volume 163. Springer Science & Business Media, 1996.
  16. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In Advances in Neural Information Processing Systems, pages 384–395, 2018.
  17. Gradient descent learns one-hidden-layer CNN: don’t be afraid of spurious local minima. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1338–1347, 2018.
  18. Gradient descent provably optimizes over-parameterized neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
  19. Porcupine neural networks:(almost) all local optima are global. arXiv preprint arXiv:1710.02196, 2017.
  20. Symmetry breaking and the maximal isotropy subgroup conjecture for reflection groups. Archive for Rational Mechanics and Analysis, 105(1):61–94, Mar 1989.
  21. Michael J. Field. Dynamics and symmetry, volume 3 of ICP Advanced Texts in Mathematics. Imperial College Press, London, 2007.
  22. Learning one-hidden-layer neural networks with landscape design. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
  23. Martin Golubitsky. The benard problem, symmetry and the lattice of isotropy subgroups. Bifurcation Theory, Mechanics and Physics. CP Boner et al., eds.(Reidel, Dordrecht, 1983), pages 225–256, 1983.
  24. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
  25. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.
  26. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
  27. A classification of the maximal subgroups of the finite alternating and symmetric groups. Journal of Algebra, 111(2):365–383, 1987.
  28. Matrix differential calculus with applications in statistics and econometrics. John Wiley & Sons, 2019.
  29. L Michel. Minima of higgs-landau polynomials. Technical report, 1979.
  30. Skip connections eliminate singularities. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
  31. Convergence results for neural networks via electrodynamics. In 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA, pages 22:1–22:19, 2018.
  32. On-line learning in soft committee machines. Physical Review E, 52(4):4225, 1995.
  33. Spurious local minima are common in two-layer relu neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 4430–4438, 2018.
  34. Minima of invariant functions: The inverse problem. Acta Applicandae Mathematicae, 137(1):233–252, 2015.
  35. Ohad Shamir. Distribution-specific hardness of learning neural networks. The Journal of Machine Learning Research, 19(1):1135–1163, 2018.
  36. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2018.
  37. Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3404–3413. JMLR. org, 2017.
  38. Dynamics of learning near singularities in layered networks. Neural computation, 20(3):813–843, 2008.
  39. Diverse neural network learns true target functions. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pages 1216–1224, 2017.
  40. Electron-proton dynamics in deep learning. arXiv preprint arXiv:1702.00458, pages 1–31, 2017.
  41. Recovery guarantees for one-hidden-layer neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4140–4149. JMLR. org, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yossi Arjevani (24 papers)
  2. Michael Field (10 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.