On the Trajectories of SGD Without Replacement (2312.16143v2)
Abstract: This article examines the implicit regularization effect of Stochastic Gradient Descent (SGD). We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks. We analyze this algorithm in a more realistic regime than typically considered in theoretical works on SGD, as, e.g., we allow the product of the learning rate and Hessian to be $O(1)$ and we do not specify any model architecture, learning task, or loss (objective) function. Our core theoretical result is that optimizing with SGD without replacement is locally equivalent to making an additional step on a novel regularizer. This implies that the expected trajectories of SGD without replacement can be decoupled in (i) following SGD with replacement (in which batches are sampled i.i.d.) along the directions of high curvature, and (ii) regularizing the trace of the noise covariance along the flat ones. As a consequence, SGD without replacement travels flat areas and may escape saddles significantly faster than SGD with replacement. On several vision tasks, the novel regularizer penalizes a weighted trace of the Fisher Matrix, thus encouraging sparsity in the spectrum of the Hessian of the loss in line with empirical observations from prior work. We also propose an explanation for why SGD does not train at the edge of stability (as opposed to GD).
- Efficient approaches for escaping higher order saddle points in non-convex optimization. arXiv:1602.05908 [cs, stat].
- On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization. arXiv:1802.06509 [cs].
- Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pages 7411–7422.
- Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53–58.
- Implicit Gradient Regularization. arXiv:2009.11162 [cs, stat]. arXiv: 2009.11162.
- Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks. Journal of Machine Learning Research, 20(63):1–17.
- Benign Overfitting in Linear Regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070. arXiv:1906.11300 [cs, math, stat].
- Reconciling modern machine learning practice and the bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854. arXiv:1812.11118 [cs, stat].
- Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. arXiv:1206.5533 [cs].
- Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process. arXiv preprint arXiv:1904.09080, page 31.
- Bottou, L. (2009). Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms. page 4.
- Bottou, L. (2012). Stochastic Gradient Descent Tricks. In Montavon, G., Orr, G. B., and Müller, K.-R., editors, Neural Networks: Tricks of the Trade: Second Edition, Lecture Notes in Computer Science, pages 421–436. Springer, Berlin, Heidelberg.
- On the Implicit Bias of Adam. arXiv:2309.00079 [cs, math, stat].
- Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–10. IEEE.
- Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks. arXiv:2306.04251 [cs, stat].
- Adaptive Gradient Methods at the Edge of Stability. arXiv:2207.14484 [cs].
- Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability. arXiv:2103.00065 [cs, stat]. arXiv: 2103.00065.
- Label Noise SGD Provably Prefers Flat Global Minimizers. arXiv:2106.06530 [cs, math, stat]. arXiv: 2106.06530.
- Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability. arXiv:2209.15594 [cs, math, stat].
- Escaping Saddles with Stochastic Gradients. arXiv:1803.05999 [cs, math, stat].
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. arXiv:1406.2572 [cs, math, stat].
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. ISSN: 1063-6919.
- Essentially No Barriers in Neural Network Energy Landscape. arXiv:1803.00885 [cs, stat]. arXiv: 1803.00885.
- Gradient Descent Can Take Exponential Time to Escape Saddle Points. arXiv:1705.10412 [cs, math, stat].
- Sharp Analysis for Nonconvex SGD Escaping from Saddle Points. arXiv:1902.00247 [cs, math].
- Topology and Geometry of Half-Rectified Network Optimization. arXiv:1611.01540 [cs, stat]. arXiv: 1611.01540.
- Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. arXiv:1802.10026 [cs, stat]. arXiv: 1802.10026.
- Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842.
- Stochastic Training is Not Necessary for Generalization. arXiv:2109.14119 [cs, math]. arXiv: 2109.14119.
- Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent. arXiv:2302.00849 [cs, math].
- Deep learning. MIT press.
- Qualitatively characterizing neural network optimization problems. arXiv:1412.6544 [cs, stat].
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
- Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246.
- Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471.
- Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159.
- Why Random Reshuffling Beats Stochastic Gradient Descent. Mathematical Programming, 186(1-2):49–84. arXiv:1510.08560 [math].
- Geometric numerical integration: structure-preserving algorithms for ordinary differential equations. Number 31 in Springer series in computational mathematics. Springer, Berlin ; New York, 2nd ed edition. OCLC: ocm69223213.
- Random Shuffling Beats SGD after Finite Epochs. arXiv:1806.10077 [math, stat].
- Shape Matters: Understanding the Implicit Bias of the Noise Covariance. arXiv:2006.08680 [cs, stat]. arXiv: 2006.08680.
- Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs]. arXiv: 1512.03385.
- Densely Connected Convolutional Networks. arXiv:1608.06993 [cs] version: 5.
- Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization. arXiv:2012.14193 [cs, stat]. arXiv: 2012.14193.
- Three Factors Influencing Minima in SGD. arXiv:1711.04623 [cs, stat]. arXiv: 1711.04623.
- On the relation between the sharpest direc- tions of DNN loss and the SGD step length.
- The Break-Even Point on Optimization Trajectories of Deep Neural Networks. arXiv:2002.09572 [cs, stat]. arXiv: 2002.09572.
- Directional convergence and alignment in deep learning. arXiv:2006.06657 [cs, math, stat].
- Fantastic Generalization Measures and Where to Find Them. arXiv:1912.02178 [cs, stat]. arXiv: 1912.02178.
- How to Escape Saddle Points Efficiently. arXiv:1703.00887 [cs, math, stat].
- On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points. Journal of the ACM, 68(2):1–29.
- Kawaguchi, K. (2016). Deep Learning without Poor Local Minima. arXiv:1605.07110 [cs, math, stat].
- On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- An Alternative View: When Does SGD Escape Local Minima? arXiv:1802.06175 [cs].
- Big Transfer (BiT): General Visual Representation Learning. arXiv:1912.11370 [cs] version: 3.
- Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets. arXiv:1906.06247 [cs, stat]. arXiv: 1906.06247.
- Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer.
- Gradient Descent Only Converges to Minimizers.
- The large learning rate phase of deep learning: the catapult mechanism. arXiv:2003.02218 [cs, stat]. arXiv: 2003.02218.
- Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. arXiv preprint arXiv:1712.09203.
- Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. arXiv:1907.04595 [cs, stat]. arXiv: 1907.04595.
- On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs). arXiv:2102.12470 [cs, stat]. arXiv: 2102.12470.
- What Happens after SGD Reaches Zero Loss? –A Mathematical Framework. arXiv:2110.06914 [cs, stat]. arXiv: 2110.06914.
- Understanding the Loss Surface of Neural Networks for Binary Classification. arXiv:1803.00909 [cs, stat]. arXiv: 1803.00909.
- Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning. arXiv:2003.00307 [cs, math, stat]. arXiv: 2003.00307.
- Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates. arXiv:2310.17074 [cs, math, stat].
- Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890.
- Martens, J. (2020). New Insights and Perspectives on the Natural Gradient Method.
- Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning. arXiv:1810.01075 [cs, stat].
- Random Reshuffling: Simple Analysis with Vast Improvements. In Advances in Neural Information Processing Systems, volume 33, pages 17309–17320. Curran Associates, Inc.
- Exploring Generalization in Deep Learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 5947–5956. Curran Associates, Inc.
- Nguyen, Q. (2019). On Connected Sublevel Sets in Deep Learning. arXiv:1901.07417 [cs, stat]. arXiv: 1901.07417.
- ON THE LOSS LANDSCAPE OF A CLASS OF DEEP NEURAL NETWORKS WITH NO BAD LOCAL VALLEYS. page 20.
- Explicit Regularization in Overparametrized Models via Noise Injection. arXiv:2206.04613 [cs, stat].
- Papyan, V. (2019). The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. arXiv:1811.07062 [cs, stat].
- Papyan, V. (2020). Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra. arXiv:2008.11865 [cs, stat].
- The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study. In Proceedings of the 36th International Conference on Machine Learning, pages 5042–5051. PMLR. ISSN: 2640-3498.
- Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity. arXiv:2106.09524 [cs]. arXiv: 2106.09524.
- Non-attracting Regions of Local Minima in Deep and Wide Neural Networks. arXiv:1812.06486 [cs, stat].
- Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Prog. Comp., 5(2):201–226.
- Roberts, D. A. (2021). SGD Implicitly Regularizes Generalization Error. arXiv:2104.04874 [cs, stat]. arXiv: 2104.04874.
- Spurious Local Minima are Common in Two-Layer ReLU Neural Networks. arXiv:1712.08968 [cs, stat].
- The Effects of Mild Over-parameterization on the Optimization Landscape of Shallow ReLU Neural Networks. arXiv:2006.01005 [cs, stat].
- Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond.
- Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600.
- On the Origin of Implicit Regularization in Stochastic Gradient Descent. arXiv:2101.12176 [cs, stat]. arXiv: 2101.12176.
- The Implicit Bias of Gradient Descent on Separable Data. page 57.
- Sun, R.-Y. (2020). Optimization for Deep Learning: An Overview. Journal of the Operations Research Society of China, 8(2):249–294.
- On the interplay between noise and curvature and its effect on optimization and generalization. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 3503–3513. PMLR. ISSN: 2640-3498.
- Implicit Regularization in ReLU Networks with the Square Loss. In Proceedings of Thirty Fourth Conference on Learning Theory, pages 4224–4258. PMLR. ISSN: 2640-3498.
- Spurious Valleys in Two-layer Neural Network Optimization Landscapes. arXiv:1802.06384 [cs, math, stat]. arXiv: 1802.06384.
- Interplay between optimization and generalization of stochastic gradient descent with covariance noise. arXiv preprint arXiv:1902.08234.
- Wilkinson, J. H. (1960). Error analysis of floating-point computation. Numerische Mathematik, 2(1):319–340.
- Kernel and Rich Regimes in Overparametrized Models. In Proceedings of Thirty Third Conference on Learning Theory, pages 3635–3673. PMLR. ISSN: 2640-3498.
- Yaida, S. (2018). Fluctuation-dissipation relations for stochastic gradient descent. arXiv preprint arXiv:1810.00004.
- Understanding deep learning requires rethinking generalization. arXiv:1611.03530 [cs].
- Understanding Deep Learning (Still) Requires Rethinking Generalization | March 2021 | Communications of the ACM.
- Embedding Principle of Loss Landscape of Deep Neural Networks. arXiv:2105.14573 [cs, stat].