Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD (2412.20553v3)

Published 29 Dec 2024 in cs.LG, math.OC, and stat.ML

Abstract: Recent findings by Cohen et al., 2021, demonstrate that when training neural networks with full-batch gradient descent with a step size of $\eta$, the largest eigenvalue $\lambda_{\max}$ of the full-batch Hessian consistently stabilizes at $\lambda_{\max} = 2/\eta$. These results have significant implications for convergence and generalization. This, however, is not the case of mini-batch stochastic gradient descent (SGD), limiting the broader applicability of its consequences. We show that SGD trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at $2/\eta$ is Batch Sharpness: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence $\lambda_{\max}$--which is generally smaller than Batch Sharpness--is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.

Summary

The paper introduces the 'Edge of Stochastic Stability' (EoSS) to describe SGD's behavior, contrasting it with full-batch GD's 'Edge of Stability' (EoS) regarding loss function sharpness.
Empirical results show Mini-Batch Sharpness (MiniBS) consistently stabilizes near 2/η in SGD, helping explain its implicit regularization and convergence to flatter minima.
The work provides insights for designing better optimization algorithms and highlights the need to understand directional noise characteristics in SGD, noting a discrepancy between theoretical and empirical scaling laws.

Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD

The paper "Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD" by Arseniy Andreyev and Pierfrancesco Beneventano brings forth a nuanced understanding of the behavior of stochastic gradient descent (SGD) in comparison to full-batch gradient descent (GD). Their investigation is centered on the concept of sharpness—defined as the largest eigenvalue of the Hessian of the loss function—and how it behaves under different learning paradigms. The key contribution of the paper is the introduction and exploration of the 'Edge of Stochastic Stability' (EoSS) as the regime under which SGD operates, contrasted with the previously characterized 'Edge of Stability' (EoS) for full-batch GD.

Key Insights and Contributions

The paper begins by revisiting the findings of Cohen et al., who established that sharpness stabilizes at $2/\eta$ under full-batch GD. This stabilization serves as a counterintuitive regime where although the process appears to hover at the edge of an instability threshold, it maintains convergence, thanks to the peculiar state of the loss landscape. Andreyev and Beneventano extend this notion to SGD, proposing that while direct sharpness stabilization is absent, a similar threshold condition exists for mini-batch scenarios, which they term EoSS.

The authors introduce the concept of Mini-Batch Sharpness (MiniBS) as a critical measure, representing the average of the largest eigenvalues of the Hessians of individual mini-batch losses. This measure fundamentally aids in understanding SGD dynamics as they argue that MiniBS aligns closer to $2/\eta$ and stabilizes, offering insight into the network's training behavior.

Theoretical and Empirical Validation

Through empirical validation across various datasets and deep learning architectures, the paper demonstrates that MiniBS consistently hovers near $2/\eta$ , similar to the FullBS in EoS but adapted to the stochastic nature of SGD. This effect elucidates why SGD is often viewed as implicitly regularizing and promoting convergence towards flatter minima—a phenomenon keenly desirable for generalization in deep learning applications.

Notably, the authors address the gap between MiniBS and FullBS, observing that this discrepancy scales approximately as the inverse of the batch size. Interestingly, while theory derived from random matrix theory predicts a $1/b$ scaling, the authors find empirical evidence suggesting a $b^{-0.7}$ scaling, indicating a rich area for further exploration to resolve this inconsistency.

Implications and Future Directions

This work impacts both theoretical inquiry and practical optimization strategies. The introduction of EoSS broadens the understanding of stability in mini-batch training regimes, providing a scaffold for developing optimization algorithms that harness these insights for more robust and efficient learning.

Moreover, the findings prompt a reconsideration of the simplistic mapping of SGD to gradient disturbance driven by isotropic noise, as the results underline the critical role of directional noise characteristics in SGD's regularization pathways. This realization invites further research into stochastic dynamics modeling, potentially leading to refined simulations or analyses that capture these nuances more effectively.

While promising, the paper also highlights several open questions—particularly around understanding the divergence between theoretically and empirically observed scaling laws, and how stochastic frameworks might be refined to align with observed mini-batch behaviors.

In conclusion, Andreyev and Beneventano’s work significantly enhances our understanding of SGD dynamics at the edge of stability, presenting avenues for research into modeling SGD's behavior in high-dimensional non-convex landscapes, ultimately propelling forward the design of more transparent and theoretically grounded neural network training methodologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1875915154241802555