Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis (2106.02588v2)

Published 4 Jun 2021 in cs.LG, math.AP, and stat.ML

Abstract: The representation of functions by artificial neural networks depends on a large number of parameters in a non-linear fashion. Suitable parameters of these are found by minimizing a 'loss functional', typically by stochastic gradient descent (SGD) or an advanced SGD-based algorithm. In a continuous time model for SGD with noise that follows the 'machine learning scaling', we show that in a certain noise regime, the optimization algorithm prefers 'flat' minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.

Citations (32)

View on Semantic Scholar

Summary

The paper introduces a modified SGD framework incorporating noise that scales with the objective function, demonstrating invariant measure properties.
It employs continuous-time analysis and the Poincaré-Hardy inequality to prove exponential convergence toward flat minima.
The findings underscore the importance of noise characteristics in selecting flat minima, which influences SGD's generalizability in deep learning.

Continuous-Time Analysis of Stochastic Gradient Descent with Machine Learning Noise

Introduction

The paper "Stochastic Gradient Descent with Noise of Machine Learning Type. Part II: Continuous Time Analysis" (2106.02588) presents a comprehensive paper of a modified stochastic gradient descent (SGD) model considering machine learning-specific noise in continuous time. This noise model is pivotal since it mimics realistic conditions in overparameterized deep learning architectures where traditional isotropic and homogeneous noise assumptions fall short. The research explores invariant measures, convergence dynamics, and the implications of flat minimum selection due to the noise's scaling properties.

Noise Modeling and Invariant Measures

Traditional noise models in SGD often assume isotropic, homogeneous Gaussian perturbations, which do not reflect the realistic conditions in SGD as applied in machine learning. The paper instead explores a noise model where the covariance $\Sigma(\theta)$ of the noise scales with the objective function $f$ . This scaling is characterized by $\Sigma = \eta \sigma f I_{m\times m}$ , leading to a stochastic differential equation (SDE) that governs the SGD dynamics.

The authors derive the invariant measure of this SDE and show that it is of the form $f^\alpha$ with $\alpha = -\frac{1 + \eta \sigma}{\eta \sigma}$ . They note that for the measure to be integrable (a requirement for it to define a probability distribution), $\alpha$ must satisfy certain bounds related to the growth rate of $f$ . These conditions ensure the existence of an invariant distribution that concentrates around `flat' minima.

Convergence and Minimum Selection

A focal point of this analysis is demonstrating that under specific conditions, solutions to this model converge to an invariant distribution exponentially fast. The Poincaré-Hardy inequality is pivotal in showing this rapid convergence, which is sensitive to how the objective function scales locally (near minima) and globally (at infinity).

The paper further highlights the role of `flatness' in minimum selection. Specifically, as the noise amplitude (controlled through $\eta$ and $\sigma$ ) approaches a critical threshold, solutions are drawn to flat minima defined by the minimal eigenvalues of the Hessian orthogonal to the solution manifold. Interestingly, the notion of flatness here differs from that in models considering homogeneous noise. The paper identifies critical regimes where invariant distributions cease to be Lebesgue integrable, leading to concentration on the minima manifold.

Implications and Future Directions

This research has significant implications for understanding the behavior of SGD in overparameterized regimes typical in deep learning. It challenges the conventional wisdom by proposing that flatness, as defined by this model, is crucial for generalizability. The findings emphasize that SGD's success in deep models is intricately linked to the peculiarities of the noise it encounters.

Future research directions include extending this analysis to account for anisotropic noise, which more accurately models machine learning scenarios, particularly in mini-batch SGD. Additionally, the impact of heavy-tailed noise distributions could provide further insights into the robustness and convergence properties of practical learning algorithms.

Conclusion

The paper rigorously advances the theoretical understanding of SGD in the machine learning context by developing a continuous-time framework that accommodates noise scaling with the optimization landscape. It provides critical insights into how noise influences convergence and the propensity of SGD to select flat minima, emphasizing the nuanced interactions between noise characteristics and optimization geometry. These findings pave the way for future explorations into more complex noise models and their implications for deep learning efficacy.