Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Sharpness-Aware Optimization Through Variance Suppression (2309.15639v3)

Published 27 Sep 2023 in cs.LG

Abstract: Sharpness-aware minimization (SAM) has well documented merits in enhancing generalization of deep neural networks, even without sizable data augmentation. Embracing the geometry of the loss function, where neighborhoods of 'flat minima' heighten generalization ability, SAM seeks 'flat valleys' by minimizing the maximum loss caused by an adversary perturbing parameters within the neighborhood. Although critical to account for sharpness of the loss function, such an 'over-friendly adversary' can curtail the outmost level of generalization. The novel approach of this contribution fosters stabilization of adversaries through variance suppression (VaSSO) to avoid such friendliness. VaSSO's provable stability safeguards its numerical improvement over SAM in model-agnostic tasks, including image classification and machine translation. In addition, experiments confirm that VaSSO endows SAM with robustness against high levels of label noise.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Towards understanding sharpness-aware minimization. In Proc. Int. Conf. Machine Learning, pages 639–668, 2022.
  2. Implicit gradient regularization. In Proc. Int. Conf. Learning Represention, 2021.
  3. The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima. arXiv:2210.01513, 2022.
  4. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838, 2016.
  5. Report on the 11th iwslt evaluation campaign, iwslt 2014. 2014.
  6. Entropy-SGD: Biasing gradient descent into wide valleys. In Proc. Int. Conf. Learning Represention, 2017.
  7. When vision transformers outperform resnets without pre-training or strong data augmentations. In Proc. Int. Conf. Learning Represention, 2022.
  8. ImageNet: A large-scale hierarchical image database. In Proc. Conf. Computer Vision and Pattern Recognition, pages 248–255, 2009.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
  10. Improved regularization of convolutional neural networks with cutout. abs/1708.04552, 2017.
  11. Efficient sharpness-aware minimization for improved training of neural networks. In Proc. Int. Conf. Learning Represention, 2022a.
  12. Sharpness-aware training for free. In Proc. Adv. Neural Info. Processing Systems, 2022b.
  13. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proc. Conf. Uncerntainty in Artif. Intel., 2017.
  14. Sharpness-aware minimization for efficiently improving generalization. In Proc. Int. Conf. Learning Represention, 2021.
  15. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  16. An investigation into neural net optimization via hessian eigenvalue density. In Proc. Int. Conf. Machine Learning, pages 2232–2241, 2019.
  17. Accelerated stochastic gradient-free and projection-free methods. In Proc. Int. Conf. Machine Learning, pages 4519–4530. PMLR, 2020.
  18. Averaging weights leads to wider optima and better generalization. In Proc. Conf. Uncerntainty in Artif. Intel., pages 876–885, 2018.
  19. Three factors influencing minima in sgd. arXiv:1711.04623, 2017.
  20. The break-even point on optimization trajectories of deep neural networks. In Proc. Int. Conf. Learning Represention, 2020.
  21. An adaptive policy to employ sharpness-aware minimization. arXiv:2304.14647, 2023.
  22. Fantastic generalization measures and where to find them. arXiv:1912.02178, 2019.
  23. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Advances in Neural Info. Process. Syst., pages 315–323, Lake Tahoe, Nevada, 2013.
  24. On large-batch training for deep learning: Generalization gap and sharp minima. In Proc. Int. Conf. Learning Represention, 2016.
  25. Exploring the effect of multi-step ascent in sharpness-aware minimization. arXiv:2302.10181, 2023.
  26. Fisher SAM: Information geometry and sharpness aware minimisation. In Proc. Int. Conf. Machine Learning, pages 11148–11161, 2022.
  27. Adam: A method for stochastic optimization. In Proc. Int. Conf. Learning Represention, 2014.
  28. ASAM: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In Proc. Int. Conf. Machine Learning, volume 139, pages 5905–5914, 2021.
  29. Almost tune-free variance reduction. In Proc. Intl. Conf. on Machine Learning, 2020.
  30. A momentum-guided Frank-Wolfe algorithm. IEEE Trans. on Signal Processing, 69:3597–3611, 2021a.
  31. Heavy ball momentum for conditional gradient. In Proc. Advances in Neural Info. Process. Syst., 2021b.
  32. Enhancing Frank Wolfe with an extra subproblem. In Proc. of 35th AAAI Conf. on Artificial Intelligence, 2021c.
  33. Towards efficient and scalable sharpness-aware minimization. In Proc. Conf. Computer Vision and Pattern Recognition, volume 2022, pages 12350–12360, 2022.
  34. Decoupled weight decay regularization. In Proc. Int. Conf. Learning Represention, 2017.
  35. Directional Statistics. Directional statistics, 2000.
  36. Make sharpness-aware minimization stronger: A sparsified perturbation approach. In Proc. Adv. Neural Info. Processing Systems, 2022.
  37. Stochastic conditional gradient methods: From convex minimization to submodular maximization. Journal of Machine Learning Research, 21(1):4232–4280, 2020.
  38. Exploring generalization in deep learning. In Proc. Adv. Neural Info. Processing Systems, volume 30, pages 5947–5956, 2017.
  39. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proc. Intl. Conf. Machine Learning, Sydney, Australia, 2017.
  40. Stochastic Frank-Wolfe methods for nonconvex optimization. In Allerton conference on communication, control, and computing, pages 1244–1251. IEEE, 2016.
  41. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15:1929–1958, 2014.
  42. Language models are few-shot learners. In Proc. Adv. Neural Info. Processing Systems, volume 33, pages 1877–1901, 2020.
  43. Understanding and increasing efficiency of frank-wolfe adversarial training. In Proc. Conf. Computer Vision and Pattern Recognition, pages 50–59, 2022.
  44. Attention is all you need. In Proc. Adv. Neural Info. Processing Systems, volume 30, 2017.
  45. Sharpness-aware gradient matching for domain generalization. In Proc. Conf. Computer Vision and Pattern Recognition, pages 3769–3778, 2023.
  46. On the generalization of models trained with sgd: Information-theoretic bounds and implications. In Proc. Int. Conf. Learning Represention, 2022.
  47. How does sharpness-aware minimization minimizes sharpness. In Proc. Int. Conf. Learning Represention, 2023.
  48. The marginal value of adaptive gradient methods in machine learning. In Proc. Int. Conf. Machine Learning, volume 30, pages 4148–4158, 2017.
  49. Adversarial weight perturbation helps robust generalization. In Proc. Adv. Neural Info. Processing Systems, volume 33, pages 2958–2969, 2020.
  50. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021a.
  51. Accelerating frank-wolfe with weighted average gradients. In Proc. IEEE Int. Conf. Acoust., Speech, Sig. Process., pages 5529–5533, 2021b.
  52. GA-SAM: Gradient-strength based adaptive sharpness-aware minimization for improved generalization. In Proc. Conf. Empirical Methods in Natural Language Processing, 2022.
  53. Penalizing gradient norm for efficiently improving generalization in deep learning. In Proc. Int. Conf. Machine Learning, pages 26982–26992, 2022a.
  54. SS-SAM: Stochastic scheduled sharpness-aware minimization for efficiently training deep neural networks. arXiv:2203.09962, 2022b.
  55. Surrogate gap minimization improves sharpness-aware training. In Proc. Int. Conf. Learning Represention, 2022.
Citations (17)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com