Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the diffusion approximation of nonconvex stochastic gradient descent (1705.07562v2)

Published 22 May 2017 in stat.ML and cs.LG

Abstract: We study the Stochastic Gradient Descent (SGD) method in nonconvex optimization problems from the point of view of approximating diffusion processes. We prove rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution. In the small step size regime and the presence of omnidirectional noise, our weak approximating diffusion process suggests the following dynamics for the SGD iteration starting from a local minimizer (resp.~saddle point): it escapes in a number of iterations exponentially (resp.~almost linearly) dependent on the inverse stepsize. The results are obtained using the theory for random perturbations of dynamical systems (theory of large deviations for local minimizers and theory of exiting for unstable stationary points). In addition, we discuss the effects of batch size for the deep neural networks, and we find that small batch size is helpful for SGD algorithms to escape unstable stationary points and sharp minimizers. Our theory indicates that one should increase the batch size at later stage for the SGD to be trapped in flat minimizers for better generalization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Wenqing Hu (26 papers)
  2. Chris Junchi Li (25 papers)
  3. Lei Li (1293 papers)
  4. Jian-Guo Liu (152 papers)
Citations (24)

Summary

We haven't generated a summary for this paper yet.