Simplicity Bias via Global Convergence of Sharpness Minimization (2410.16401v1)

Published 21 Oct 2024 in cs.LG, math.ST, stat.ML, and stat.TH

Abstract: The remarkable generalization ability of neural networks is usually attributed to the implicit bias of SGD, which often yields models with lower complexity using simpler (e.g. linear) and low-rank features. Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folklore intuition that flat solutions are 'simple', the connection with the simplicity of the final trained model (e.g. low-rank) is not well understood. In this work, we take a step toward bridging this gap by studying the simplicity structure that arises from minimizers of the sharpness for a class of two-layer neural networks. We show that, for any high dimensional training data and certain activations, with small enough step size, label noise SGD always converges to a network that replicates a single linear feature across all neurons; thereby, implying a simple rank one feature matrix. To obtain this result, our main technical contribution is to show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks. Along the way, we discover a novel property -- a local geodesic convexity -- of the trace of Hessian of the loss at approximate stationary points on the manifold of zero loss, which links sharpness to the geometry of the manifold. This tool may be of independent interest.

Summary

The paper establishes that SGD’s implicit simplicity bias emerges by minimizing the Hessian trace at global minima.
It proves that under certain conditions, the gradient flow converges exponentially fast through local g-convexity at approximate stationary points.
The framework connects sharpness reduction with simplicity bias, offering insights for optimizing overparameterized neural networks.

Overview of "Trace of Hessian Convergence"

The paper "Trace of Hessian Convergence" by Khashayar Gatmiry explores the understanding of the implicit biases of stochastic gradient descent (SGD) in overparameterized neural networks. Specifically, it investigates how these biases lead to generalization abilities even in the absence of explicit regularization and when the training loss reaches zero.

Key Contributions

The work focuses on the connection between sharpness minimization and simplicity bias in the context of SGD. It seeks to provide a theoretical basis for how SGD's implicit bias can lead to simpler solutions, characterized by both less complex features and lower sharpness in the loss landscape. The research introduces a framework that connects these phenomena by analyzing the behavior of SGD, particularly through the lens of the Hessian trace minimization.

Theoretical Insights

Stationary Points and Global Minima: The paper characterizes the stationary points of the trace of the Hessian on the manifold of zero loss. It shows that under certain regularity assumptions for activations and coherence conditions for the data, these points are indeed global minimizers. It posits that at any global optimum, neuron activations align within the subspace spanned by the data, thus supporting the simplicity bias empirically observed in SGD.
Gradient Flow Convergence: The research establishes conditions under which the gradient flow converges exponentially fast to a global minimum. This offers a strong guarantee about the convergence properties of SGD in overparameterized networks, addressing key challenges due to non-convex loss landscapes.
Local g-Convexity and Convergence Rates: Utilizing novel contributions like local g-convexity at approximate stationary points, the paper demonstrates the implicit regularizer's convexity characteristics. This finding is pivotal in proving global convergence without necessitating extra assumptions, thus providing a rigorous grounding for the observed behavior of SGD.

Methodological Approach

The paper applies differential geometry concepts to understand the behavior of the model on the manifold of zero loss. By analyzing the gradient flow and leveraging assumptions about the data and activation functions, the paper derives results that not only show convergence but also provide explicit convergence rates.

Implications and Future Work

The findings have both theoretical and practical implications. Theoretically, they enhance our understanding of SGD's implicit biases and the role of loss sharpness in achieving simplicity. Practically, these insights can inform better initialization strategies and optimization techniques for training overparameterized models.

Regarding future directions, the exploration of different network architectures and loss functions could yield further insights. The potential expansion of these results could include more complex neural architectures and broader classes of activation functions.

Conclusion

The paper offers significant advancements in comprehending the implicit behaviors of SGD. By linking the minimization of Hessian trace with simplicity biases and demonstrating robust convergence properties, it provides a foundational understanding that could influence future research in deep learning optimization and generalization.

PDF Markdown

Tweets

https://twitter.com/StatMLPapers/status/1849269157411299746