Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective (2112.00987v1)

Published 2 Dec 2021 in cs.LG, math.ST, and stat.TH

Abstract: We study the statistical properties of the dynamic trajectory of stochastic gradient descent (SGD). We approximate the mini-batch SGD and the momentum SGD as stochastic differential equations (SDEs). We exploit the continuous formulation of SDE and the theory of Fokker-Planck equations to develop new results on the escaping phenomenon and the relationship with large batch and sharp minima. In particular, we find that the stochastic process solution tends to converge to flatter minima regardless of the batch size in the asymptotic regime. However, the convergence rate is rigorously proven to depend on the batch size. These results are validated empirically with various datasets and models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Xiaowu Dai (28 papers)
  2. Yuhua Zhu (26 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.