Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is SGD a Bayesian sampler? Well, almost (2006.15191v2)

Published 26 Jun 2020 in cs.LG and stat.ML

Abstract: Overparameterised deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalise remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with low generalisation error. Here we empirically investigate this inductive bias by calculating, for a range of architectures and datasets, the probability $P_{SGD}(f\mid S)$ that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function $f$ consistent with a training set $S$. We also use Gaussian processes to estimate the Bayesian posterior probability $P_B(f\mid S)$ that the DNN expresses $f$ upon random sampling of its parameters, conditioned on $S$. Our main findings are that $P_{SGD}(f\mid S)$ correlates remarkably well with $P_B(f\mid S)$ and that $P_B(f\mid S)$ is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines $P_B(f\mid S)$), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior $P_B(f\mid S)$ is the first order determinant of $P_{SGD}(f\mid S)$, there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on $P_{SGD}(f\mid S)$ and/or $P_B(f\mid S)$, can shed new light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.

Citations (47)

Summary

  • The paper demonstrates that SGD’s convergence closely aligns with Bayesian posterior predictions, with 90% of its function probabilities matching empirical results.
  • It reveals that the DNN parameter-function mapping is biased towards simpler, low-error functions, reinforcing the principles of Occam's razor.
  • It shows that architectural choices and hyperparameter settings noticeably influence SGD's probabilistic behavior, impacting model generalization.

Analyzing the Probabilistic Behavior of SGD in Deep Learning: A Bayesian Perspective

The paper "Is SGD a Bayesian sampler? Well, almost." offers an empirical investigation into the behavior of stochastic gradient descent (SGD) and its variants in the context of deep neural networks (DNNs), particularly in the regime where these networks are overparameterized. The paper explores whether the observed generalization capabilities of DNNs can be explained by viewing SGD as a Bayesian sampler.

Empirical Setup and Methodology

The authors investigate a hypothesis: that the probability of a DNN converging to a particular function under SGD reflects the Bayesian posterior probability of that function. This is based under the assumption that parameters are randomly sampled according to a Gaussian process. The analysis is conducted across a variety of network architectures including fully connected networks (FCNs), convolutional networks (CNNs), and long short-term memory networks (LSTMs), applied to datasets such as MNIST, Fashion-MNIST, IMDb, and an ionosphere dataset.

Key Findings

  1. Correlation Between SGD and Bayesian Posterior: A pivotal finding is that the empirical probability distributions of functions learned through SGD match closely with those predicted by Bayesian inference. In specific experiments, 90% of the probabilistic weight of functions found by SGD match with those predicted by the Bayesian posterior.
  2. Bias Towards Simplicity and Low Error: The research underscores a significant bias of the DNN parameter-function map towards functions that generalize well, suggesting a mapping aligned with Occam's principle, where simpler (or lower complexity) functions are more probable.
  3. Effects of Architecture and Hyperparameters: Variations such as the inclusion of max pooling in CNNs or changing batch size notably affect the distribution of function probabilities. Such modifications can enhance the implicit bias towards functions with lower generalization error.
  4. Deviations in Overparameterization: While SGD behaves similarly to a Bayesian sampler at a high level, second-order effects such as optimization path and parameter initialization do introduce deviations from this behavior.

Theoretical Implications

The paper contributes to ongoing discourse on the theoretical underpinnings of DNN generalization by suggesting that the inductive biases prominent in optimizer-trained DNNs are largely predetermined by their parameter-function mappings rather than being primarily conferred by the optimizer itself. This insight plays a critical role in understanding why DNNs generalize well in the overparameterized regime, a puzzling observation given classical learning theory's expectations of overfitting.

Practical Implications

In practice, this work suggests that leveraging the initial parameter settings and promoting architectures that align with the inherently biased parameter-function maps could be as crucial as the choice of optimization strategy itself. Moreover, understanding the relative behavior of SGD as a Bayesian sampler can justify using ensemble methods for uncertainty quantification and improve model robustness in applied settings.

Future Directions

Further exploration is warranted in characterizing how specific choices in architecture and optimizer impacts these probabilistic distributions, particularly in larger-scale problems involving more intricate datasets. Additionally, the authors highlight the potential to paper these observations through the lens of the neural tangent kernel (NTK), which offers broader insights into the convergence properties of DNNs as modeled by gradient descent dynamics.

This paper's exploration into the connection between SGD and Bayesian inference provides an empirical backing for the observed generalization robustness in deep learning, suggesting profound insights into both the theoretical and practical domains of AI research.

Youtube Logo Streamline Icon: https://streamlinehq.com