On the Trajectories of SGD Without Replacement (2312.16143v2)

Published 26 Dec 2023 in cs.LG, math.OC, and stat.ML

Abstract: This article examines the implicit regularization effect of Stochastic Gradient Descent (SGD). We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks. We analyze this algorithm in a more realistic regime than typically considered in theoretical works on SGD, as, e.g., we allow the product of the learning rate and Hessian to be $O(1)$ and we do not specify any model architecture, learning task, or loss (objective) function. Our core theoretical result is that optimizing with SGD without replacement is locally equivalent to making an additional step on a novel regularizer. This implies that the expected trajectories of SGD without replacement can be decoupled in (i) following SGD with replacement (in which batches are sampled i.i.d.) along the directions of high curvature, and (ii) regularizing the trace of the noise covariance along the flat ones. As a consequence, SGD without replacement travels flat areas and may escape saddles significantly faster than SGD with replacement. On several vision tasks, the novel regularizer penalizes a weighted trace of the Fisher Matrix, thus encouraging sparsity in the spectrum of the Hessian of the loss in line with empirical observations from prior work. We also propose an explanation for why SGD does not train at the edge of stability (as opposed to GD).

References (99)

Citations (3)

View on Semantic Scholar

Summary

The paper reveals that SGD without replacement converges faster and exhibits lower variance than SGD with replacement and noised GD.
It demonstrates that an implicit regularizer penalizes gradient covariance, influencing the Hessian's spectrum towards sparsity.
The study shows SGD without replacement escapes saddle points rapidly, favoring flatter minima that likely generalize well.

Core Theoretical Result

This paper presents a theoretical paper of Stochastic Gradient Descent (SGD) without replacement, also known as random reshuffling, which is a commonly used variant in the optimization of large-scale neural networks. The central theoretical result posits that SGD without replacement diverges significantly from SGD with replacement and from gradient descent (GD) with noise injection. The divergence is evident in both the direction and the pace at which they traverse flat regions of the loss landscape. While SGD with replacement and noised GD move towards the lowest norm solution, SGD without replacement converges faster and with lower variance.

The paper establishes that, in expectation, SGD without replacement can escape saddle points substantially quicker than its counterparts. It introduces an implicit regularizer that penalizes the covariance of the gradients, which shows that SGD without replacement traverses through regions with lower gradient noise variance.

Implicit Regularization

The work explores the phenomenon known as implicit regularization, demonstrating that SGD without replacement adds a regularizer that penalizes the weighted trace of the noise covariance, which is related to the Fisher Matrix and the Hessian. This regularization encourages sparsity in the Hessian's spectrum and aligns with empirical observations from previous studies, suggesting that while SGD converges to almost-global minima, it also shapes the Hessian's spectrum during training.

Moreover, it is suggested that SGD without replacement regularizes along the eigenspaces corresponding to small and negative Hessian eigenvalues, which are associated with controlling model overfitting and effective complexity. This leads to the model fitting better and finding flatter minima that likely generalize well.

Additional Insights

The paper explores why SGD without replacement does not train at the edge of stability contrary to what is often noticed with GD. The authors provide evidence indicating the presence of a phase transition when the largest eigenvalue of the Hessian exceeds a threshold, causing the expected trajectory of SGD without replacement to deviate from that of GD.

In comparing SGD with and without replacement, the paper reveals qualitative differences in how they navigate the loss landscape. SGD without replacement moves through regions where the loss is nearly constant markedly faster and exhibits smaller oscillations due to its lower variance and the noise penalization effect inherent to the trajectory.

Concluding Remarks

The findings indicate that SGD without replacement offers a bias toward areas with lower variance, differing fundamentally from SGD with replacement whose behavior may be attributed to noise diffusion. These insights contribute to a better understanding of the nature of SGD without replacement, its regularization effects, and its practical implications for training neural networks.

PDF Markdown

Tweets

https://twitter.com/1232935877439037440/status/1740089394906722602

https://twitter.com/20104867/status/1740237377124303107

https://twitter.com/firoozye/status/1782672728203444449