On the Overlooked Structure of Stochastic Gradients (2212.02083v3)
Abstract: Stochastic gradients closely relate to both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures overlooked by previous studies and present its theoretical implications for training of DNNs. While previous studies believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients in deep learning.
- powerlaw: a python package for analysis of heavy-tailed distributions. PloS one, 9(1):e85777.
- Sgd generalizes better than gd (and regularization doesn’t help). In Conference on Learning Theory, pages 63–92. PMLR.
- On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995.
- Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation. Behavior research methods, 49(5):1716–1735.
- Neural gradients are near-lognormal: improved quantized and sparse training. In International Conference on Learning Representations.
- Power-law distributions in empirical data. SIAM review, 51(4):661–703.
- Escaping saddles with stochastic gradients. In International Conference on Machine Learning, pages 1155–1164.
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems, 27:2933–2941.
- The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46.
- Reliable writer identification in medieval manuscripts through page layout features: The “avila” bible case. Engineering Applications of Artificial Intelligence, 72:99–110.
- The goldilocks zone: Towards better understanding of neural network loss landscapes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3574–3581.
- An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pages 2232–2241. PMLR.
- Limit distributions for sums of independent. Am. J. Math, 105.
- Problems with fitting to the power-law distribution. The European Physical Journal B-Condensed Matter and Complex Systems, 41(2):255–258.
- Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754.
- The heavy-tail phenomenon in sgd. In International Conference on Machine Learning, pages 3964–3975. PMLR.
- A survey of label-noise representation learning: Past, present and future. arXiv preprint arXiv:2011.04406.
- Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pages 8527–8537.
- Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, pages 2315–2357. PMLR.
- Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Simplifying neural nets by discovering flat minima. In Advances in neural information processing systems, pages 529–536.
- Flat minima. Neural Computation, 9(1):1–42.
- Multiplicative noise and heavy tails in stochastic optimization. In International Conference on Machine Learning, pages 4262–4274. PMLR.
- The asymptotic spectrum of the hessian of dnn throughout training. In International Conference on Learning Representations.
- Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623.
- Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015.
- Learning multiple layers of features from tiny images.
- LeCun, Y. (1998). The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/.
- Deep learning. nature, 521(7553):436.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
- Hessian based analysis of sgd for deep nets: Dynamics and generalization. In Proceedings of the 2020 SIAM International Conference on Data Mining, pages 190–198. SIAM.
- On the validity of modeling SGD with stochastic differential equations (SDEs). In Thirty-Fifth Conference on Neural Information Processing Systems.
- On the validity of modeling sgd with stochastic differential equations (sdes). arXiv preprint arXiv:2102.12470.
- Hessian eigenspectra of more realistic nonlinear models. Advances in Neural Information Processing Systems, 34.
- Noise and fluctuation of finite learning rate stochastic gradient descent. In International Conference on Machine Learning, pages 7045–7056. PMLR.
- Paddlepaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Domputing, 1(1):105–115.
- Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research, 18(1):4873–4907.
- Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. arXiv preprint arXiv:1710.09553.
- Massey Jr, F. J. (1951). The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78.
- Myung, I. J. (2003). Tutorial on maximum likelihood estimation. Journal of mathematical Psychology, 47(1):90–100.
- Non-gaussianity of stochastic gradient noise. arXiv preprint arXiv:1910.09626.
- Papyan, V. (2018). The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062.
- Papyan, V. (2019). Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. In International Conference on Machine Learning, pages 5012–5021. PMLR.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pages 8026–8037.
- Pawitan, Y. (2001). In all likelihood: statistical modelling and inference using likelihood. Oxford University Press.
- Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning, pages 2798–2806. PMLR.
- The spectrum of the fisher information matrix of a single-hidden-layer neural network. In NeurIPS, pages 5415–5424.
- Plackett, R. L. (1983). Karl pearson and the chi-squared test. International statistical review/revue internationale de statistique, pages 59–72.
- Proteins: coexistence of stability and flexibility. Physical review letters, 100(20):208101.
- Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476.
- Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454.
- Sgd: The role of implicit regularization, batch-size and multiple-epochs. Advances in Neural Information Processing Systems, 34.
- A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pages 5827–5837.
- Analytic insights into structure and rank of neural network hessian maps. Advances in Neural Information Processing Systems, 34:23914–23927.
- On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations.
- High-dimensional geometry of population responses in visual cortex. Nature, 571(7765):361–365.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning.
- Long-range correlation in protein dynamics: Confirmation by structural data and normal mode analysis. PLoS computational biology, 16(2):e1007670.
- Statistics of natural image categories. Network: computation in neural systems, 14(3):391.
- Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272.
- Visser, M. (2013). Zipf’s law, power laws and maximum entropy. New Journal of Physics, 15(4):043021.
- An empirical study of stochastic gradient descent with structured covariance noise. In Chiappa, S. and Calandra, R., editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3621–3631. PMLR.
- On the noisy gradient descent that generalizes as sgd. In International Conference on Machine Learning, pages 10367–10376. PMLR.
- Direction matters: On the implicit regularization effect of stochastic gradient descent with moderate learning rate. International Conference on Learning Representations.
- Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239.
- Artificial neural variability for deep learning: On overfitting, noise memorization, and catastrophic forgetting. Neural Computation, 33(8).
- A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations.
- On the power-law spectrum in deep learning: A bridge to protein science. arXiv preprint arXiv:2201.13011.
- Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 24430–24459.
- On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective. In Thirty-seventh Conference on Neural Information Processing Systems.
- Positive-negative momentum: Manipulating stochastic gradient noise to improve generalization. In International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11448–11458. PMLR.
- Hessian-based analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems, pages 4949–4959.
- A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323.
- Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations.
- Bridging mode connectivity in loss landscapes and adversarial robustness. In International Conference on Learning Representations.
- The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In ICML, pages 7654–7663.
- Strength of minibatch noise in SGD. In International Conference on Learning Representations.