Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Bayesian Perspective on Generalization and Stochastic Gradient Descent (1710.06451v3)

Published 17 Oct 2017 in cs.LG, cs.AI, and stat.ML

Abstract: We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work responds to Zhang et al. (2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. We show that the same phenomenon occurs in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. We also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. We propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the "noise scale" $g = \epsilon (\frac{N}{B} - 1) \approx \epsilon N/B$, where $\epsilon$ is the learning rate, $N$ the training set size and $B$ the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, $B_{opt} \propto \epsilon N$. We verify these predictions empirically.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Samuel L. Smith (27 papers)
  2. Quoc V. Le (128 papers)
Citations (229)

Summary

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

The paper "A Bayesian Perspective on Generalization and Stochastic Gradient Descent" by Samuel L. Smith and Quoc V. Le offers an insightful exploration of key phenomena in machine learning, particularly focusing on the generalization capabilities of models trained using stochastic gradient descent (SGD). The authors approach this investigation through the lens of Bayesian statistics, offering a framework that reconciles empirical observations with theoretical insights.

The investigation is structured around two main questions: how to predict whether a minimum found during training will generalize well, and why SGD often uncovers such minima effectively. The work responds to findings by previous authors who demonstrated that neural networks could memorize data with random labels yet generalize well with real labels. Smith and Le extend these observations to linear models, showing that the ability to memorize random labels is not unique to deep networks.

Key to their argument is the concept of Bayesian evidence, which penalizes sharp minima that exhibit high curvature. This framework effectively accounts for existing observations that sharper minima tend to generalize worse compared to broader minima. The Bayesian approach remains invariant under model reparameterization, addressing critique from other researchers who have highlighted the susceptibility of curvature-based generalization criteria to reparameterization.

Additionally, the paper explores the influence of SGD's inherent noise, linking it to stochastic differential equations. The authors propose the notion of a "noise scale" in SGD, which they express as g=ϵ(NB1)ϵN/Bg = \epsilon (\frac{N}{B} - 1) \approx \epsilon N/B, where ϵ\epsilon is the learning rate, NN is the size of the training set, and BB is the batch size. Their analysis leads to a critical insight: there exists an optimal batch size that maximizes test accuracy, directly proportional to both the learning rate and the dataset size.

Empirical validation is provided for these theoretical insights. Through experiments with both linear models and neural networks, Smith and Le confirm that the Bayesian evidence effectively predicts generalization performance, and that the optimum batch size varies linearly with both learning rate and training set size. Their results show that small mini-batch sizes, by introducing calculable noise, help guide networks towards broad minima, which Bayesian evidence would suggest have large posterior probabilities and thus better generalization capabilities.

The implications of these findings are significant, suggesting practical strategies for optimizing training dynamics in machine learning. By adjusting batch sizes in proportion to learning rates and dataset sizes, practitioners can exploit increased parallelism in distributed systems, potentially reducing training times without sacrificing performance.

Importantly, the exploration into Bayesian interpretations of training suggests further areas for research. Future studies could deepen the connection between the noise landscape induced by SGD and the probabilistic framework provided by Bayesian methods. Extended studies might also examine the potential for these insights to improve the initialization, tuning, and regularization strategies of large-scale models.

In conclusion, Smith and Le's work advances our understanding of generalization in machine learning through a Bayesian framework, offering a coherent explanation for empirical phenomena and guiding practical SGD optimization strategies. Their approach underscores the utility of Bayesian ideas in explaining and predicting the complex behaviors of machine learning models in training and deployment.