2000 character limit reached
Exact Solutions of a Deep Linear Network (2202.04777v7)
Published 10 Feb 2022 in stat.ML and cs.LG
Abstract: This work finds the analytical expression of the global minima of a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that the origin is a special point in deep neural network loss landscape where highly nonlinear phenomenon emerges. We show that weight decay strongly interacts with the model architecture and can create bad minima at zero in a network with more than $1$ hidden layer, qualitatively different from a network with only $1$ hidden layer. Practically, our result implies that common deep learning initialization methods are insufficient to ease the optimization of neural networks in general.
- Fixing a broken ELBO. In International Conference on Machine Learning, pages 159–168. PMLR.
- Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58.
- Dropout as a low-rank regularizer for matrix factorization. In International Conference on Artificial Intelligence and Statistics, pages 435–444. PMLR.
- The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204.
- Open problem: The landscape of the loss surfaces of multilayer networks. In Conference on Learning Theory, pages 1756–1760. PMLR.
- Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings.
- Generative adversarial nets. Advances in neural information processing systems, 27.
- A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243.
- Identity matters in deep learning. arXiv preprint arXiv:1611.04231.
- Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560.
- Piecewise linear activations substantially shape the loss surfaces of neural networks. arXiv preprint arXiv:2003.12236.
- Kawaguchi, K. (2016). Deep learning without poor local minima. Advances in Neural Information Processing Systems, 29:586–594.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, pages 2698–2707. PMLR.
- A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957.
- Deep linear networks with arbitrary loss: All local minima are global. In International conference on machine learning, pages 2902–2907. PMLR.
- Liu, B. (2021). Spurious local minima are common for deep neural networks with piecewise linear activations. arXiv preprint arXiv:2102.13233.
- Noise and fluctuation of finite learning rate stochastic gradient descent. In International Conference on Machine Learning, pages 7045–7056. PMLR.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Depth creates no bad local minima. arXiv preprint arXiv:1702.08580.
- Don’t Blame the ELBO! A Linear VAE Perspective on Posterior Collapse.
- Mackay, D. J. C. (1992). Bayesian methods for adaptive models. PhD thesis, California Institute of Technology.
- The loss surface of deep linear networks viewed through the algebraic geometry lens. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- On dropout and nuclear norm regularization. In International Conference on Machine Learning, pages 4575–4584. PMLR.
- Power-law escape rate of sgd. In International Conference on Machine Learning, pages 15959–15975. PMLR.
- Searching for activation functions.
- Neural collapse in deep homogeneous classifiers and the role of weight decay. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4243–4247. IEEE.
- Spurious local minima are common in two-layer relu neural networks. In International conference on machine learning, pages 4433–4441. PMLR.
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
- How regularization affects the critical points in linear networks. Advances in neural information processing systems, 30.
- Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33:6377–6389.
- Tian, Y. (2022). Deep contrastive learning is provably (almost) principal component analysis. arXiv preprint arXiv:2201.12680.
- Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research, 20:133.
- Posterior collapse of a linear latent variable model. In Advances in Neural Information Processing Systems.
- Small nonlinearities in activation functions create bad local minima in neural networks. arXiv preprint arXiv:1802.03487.
- Sgd can converge to local maxima. In International Conference on Learning Representations.
- What shapes the loss landscape of self supervised learning? In The Eleventh International Conference on Learning Representations.
- Exact phase transitions in deep learning. arXiv preprint arXiv:2205.12510.
- spred: Solving L1 Penalty with SGD. In International Conference on Machine Learning.
- Stochastic neural networks with infinite width are deterministic.