Power-law escape rate of SGD (2105.09557v2)
Abstract: Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a random time change. Using this formalism, we show that the log loss barrier $\Delta\log L=\log[L(\thetas)/L(\theta*)]$ between a local minimum $\theta*$ and a saddle $\thetas$ determines the escape rate of SGD from the local minimum, contrary to the previous results borrowing from physics that the linear loss barrier $\Delta L=L(\thetas)-L(\theta*)$ decides the escape rate. Our escape-rate formula strongly depends on the typical magnitude $h*$ and the number $n$ of the outlier eigenvalues of the Hessian. This result explains an empirical fact that SGD prefers flat minima with low effective dimensions, giving an insight into implicit biases of SGD.
- Takashi Mori (59 papers)
- Liu Ziyin (38 papers)
- Kangqiao Liu (7 papers)
- Masahito Ueda (184 papers)