Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 83 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

SAM as an Optimal Relaxation of Bayes (2210.01620v3)

Published 4 Oct 2022 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Pierre Alquier. User-friendly introduction to PAC-Bayes bounds. arXiv:2110.11216, 2021.
  2. Sharpness-aware minimization improves language model generalization. In Association for Computational Linguistics (ACL), 2022.
  3. Lifting the convex conjugate in Lagrangian relaxations: A tractable approach for continuous Markov random fields. SIAM J. Imaging Sci., 15(3):1253–1281, 2022.
  4. Low-pass filtering SGD for recovering flat optima in the deep learning optimization landscape. International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
  5. Olivier Catoni. PAC-Bayesian Supervised Classification. Number 56. Institute of Mathematical Statistics Lecture Notes – Monograph Series, 2007.
  6. When vision transformers outperform ResNets without pretraining or strong data augmentations. In International Conference on Learning Representations (ICLR), 2022.
  7. Wide mean-field Bayesian neural networks ignore the data. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.
  8. Closures of exponential families. Annals of probability, pp.  582–600, 2005.
  9. CVXPY: A Python-embedded modeling language for convex optimization. J. Mach. Learn. Res. (JMLR), 17(1):2909–2913, 2016.
  10. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
  11. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors. In International Conference on Machine Learning (ICML), 2018.
  12. Liberty or depth: Deep bayesian neural nets do not need complex weight posterior approximations. Advances in Neural Information Processing Systems, 33:4346–4357, 2020.
  13. On the expressiveness of approximate inference in Bayesian neural networks. Advances in Neural Information Processing Systems, 33, 2020.
  14. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations (ICLR), 2021.
  15. Bayesian neural network priors revisited. In International Conference on Learning Representations (ICLR), 2022.
  16. Noisy networks for exploration. In International Conference on Learning Representations (ICLR), 2018.
  17. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016.
  18. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017.
  19. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  20. Keeping the neural networks simple by minimizing the description length of the weights. In Conference on Learning Theory (COLT), 1993.
  21. Fundamentals of Convex Analysis. Springer, 2001.
  22. Flat minima. Neural Computation, 9(1):1–42, 1997.
  23. Adversarial interpretation of Bayesian inference. In International Conference on Algorithmic Learning Theory, 2022.
  24. Improving predictions of bayesian neural nets via local linearization. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2021.
  25. What are bayesian neural network posteriors really like? In International Conference on Machine Learning (ICML), 2021.
  26. Fantastic generalization measures and where to find them. In International Conference on Learning Representations (ICLR), 2020.
  27. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning (ICML), 2020.
  28. A fair comparison of two popular flat minima optimizers: Stochastic weight averaging vs. sharpness-aware minimization. arXiv:2202.00661, 2022.
  29. The Bayesian learning rule. arXiv:2107.04562, 2021.
  30. Variational adaptive-newton method for explorative learning. In NIPS Workshop on Advances in Approximate Bayesian Inference, 2017.
  31. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In International Conference on Machine Learning (ICML), 2018.
  32. Fisher SAM: Information geometry and sharpness aware minimisation. In International Conference on Machine Learning (ICML), 2022.
  33. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  34. ASAM: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning (ICML), 2021.
  35. Random sharpness-aware minimization. Advances in Neural Information Processing Systems (NeurIPS), 2022.
  36. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  37. Towards the geometry of estimation of distribution algorithms based on the exponential family. In Foundations of Genetic Algorithms (FOGA), 2011.
  38. Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. (JMLR), 18:1–35, 2017.
  39. Kevin Patrick Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.
  40. Practical deep learning with Bayesian principles. Advances in Neural Information Processing Systems (NeurIPS), 2019.
  41. Proximal algorithms. Foundations and Trends in Optimization, 1:123–231, 2013.
  42. Parameter space noise for exploration. In International Conference on Learning Representations (ICLR), 2018.
  43. Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006.
  44. Ralph Tyrrell Rockafellar and Roger J.-B Wets. Variational Analysis. Springer, 1998.
  45. A Bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations (ICLR), 2018.
  46. Convex analysis approach to DC programming: Theory, algorithms and applications. Acta Math. Vietnam., 22(1):289–355, 1997.
  47. John F Toland. Duality in nonconvex optimization. J. Math. Anal. Appl., 66(2):399–415, 1978.
  48. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  49. Adversarial weight perturbation helps robust generalization. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  50. Arnold Zellner. Optimal information processing and Bayes’s theorem. The American Statistician, 42(4):278–280, 1988.
  51. Tong Zhang. Theoretical analysis of a class of randomized regularization methods. In Conference on Learning Theory (COLT), 1999.
  52. Regularizing neural networks via adversarial model perturbation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Citations (29)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.