On the Principle of Least Symmetry Breaking in Shallow ReLU Models (1912.11939v3)
Abstract: We consider the optimization problem associated with fitting two-layer ReLU networks with respect to the squared loss, where labels are assumed to be generated by a target network. Focusing first on standard Gaussian inputs, we show that the structure of spurious local minima detected by stochastic gradient descent (SGD) is, in a well-defined sense, the \emph{least loss of symmetry} with respect to the target weights. A closer look at the analysis indicates that this principle of least symmetry breaking may apply to a broader range of settings. Motivated by this, we conduct a series of experiments which corroborate this hypothesis for different classes of non-isotropic non-product distributions, smooth activation functions and networks with a few layers.
- Singularities affect dynamics of learning in neuromanifolds. Neural computation, 18(5):1007–1065, 2006.
- Yossi Arjevani. Hidden minima in two-layer relu networks. arXiv preprint arXiv:2312.06234, 2023.
- Symmetry breaking in symmetric tensor decomposition. arXiv preprint arXiv:2103.06234, 2021.
- Analytic characterization of the hessian in shallow relu models: A tale of symmetry. Advances in Neural Information Processing Systems, 33:5441–5452, 2020.
- Analytic study of families of spurious minima in two-layer relu neural networks: a tale of symmetry ii. Advances in Neural Information Processing Systems, 34:15162–15174, 2021.
- Symmetry & critical points for a model shallow neural network. Physica D: Nonlinear Phenomena, 427:133014, 2021.
- Annihilation of spurious minima in two-layer relu networks. Advances in Neural Information Processing Systems, 35:37510–37523, 2022.
- Equivariant bifurcation, quadratic equivariants, and symmetry breaking for the standard representation of s k. Nonlinearity, 35(6):2809, 2022.
- Symmetry & critical points for symmetric tensor decompositions problems. arXiv preprint arXiv:2306.5319838, 2023.
- Maximal subgroups of finite groups. J. Algebra, 92(1):44–80, 1985.
- Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape. arXiv preprint arXiv:1907.02911, 2019.
- Globally optimal gradient descent for a convnet with gaussian inputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 605–614. JMLR. org, 2017.
- SGD learns over-parameterized networks that provably generalize on linearly separable data. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
- Proof of the satisfiability conjecture for large k. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 59–68, 2015.
- Permutation groups, volume 163. Springer Science & Business Media, 1996.
- Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In Advances in Neural Information Processing Systems, pages 384–395, 2018.
- Gradient descent learns one-hidden-layer CNN: don’t be afraid of spurious local minima. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1338–1347, 2018.
- Gradient descent provably optimizes over-parameterized neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
- Porcupine neural networks:(almost) all local optima are global. arXiv preprint arXiv:1710.02196, 2017.
- Symmetry breaking and the maximal isotropy subgroup conjecture for reflection groups. Archive for Rational Mechanics and Analysis, 105(1):61–94, Mar 1989.
- Michael J. Field. Dynamics and symmetry, volume 3 of ICP Advanced Texts in Mathematics. Imperial College Press, London, 2007.
- Learning one-hidden-layer neural networks with landscape design. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
- Martin Golubitsky. The benard problem, symmetry and the lattice of isotropy subgroups. Bifurcation Theory, Mechanics and Physics. CP Boner et al., eds.(Reidel, Dordrecht, 1983), pages 225–256, 1983.
- Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
- Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.
- Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
- A classification of the maximal subgroups of the finite alternating and symmetric groups. Journal of Algebra, 111(2):365–383, 1987.
- Matrix differential calculus with applications in statistics and econometrics. John Wiley & Sons, 2019.
- L Michel. Minima of higgs-landau polynomials. Technical report, 1979.
- Skip connections eliminate singularities. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
- Convergence results for neural networks via electrodynamics. In 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA, pages 22:1–22:19, 2018.
- On-line learning in soft committee machines. Physical Review E, 52(4):4225, 1995.
- Spurious local minima are common in two-layer relu neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 4430–4438, 2018.
- Minima of invariant functions: The inverse problem. Acta Applicandae Mathematicae, 137(1):233–252, 2015.
- Ohad Shamir. Distribution-specific hardness of learning neural networks. The Journal of Machine Learning Research, 19(1):1135–1163, 2018.
- Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2018.
- Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3404–3413. JMLR. org, 2017.
- Dynamics of learning near singularities in layered networks. Neural computation, 20(3):813–843, 2008.
- Diverse neural network learns true target functions. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pages 1216–1224, 2017.
- Electron-proton dynamics in deep learning. arXiv preprint arXiv:1702.00458, pages 1–31, 2017.
- Recovery guarantees for one-hidden-layer neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4140–4149. JMLR. org, 2017.
- Yossi Arjevani (24 papers)
- Michael Field (10 papers)