Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization (2303.03108v3)
Abstract: Recently, flat minima are proven to be effective for improving generalization and sharpness-aware minimization (SAM) achieves state-of-the-art performance. Yet the current definition of flatness discussed in SAM and its follow-ups are limited to the zeroth-order flatness (i.e., the worst-case loss within a perturbation radius). We show that the zeroth-order flatness can be insufficient to discriminate minima with low generalization error from those with high generalization error both when there is a single minimum or multiple minima within the given perturbation radius. Thus we present first-order flatness, a stronger measure of flatness focusing on the maximal gradient norm within a perturbation radius which bounds both the maximal eigenvalue of Hessian at local minima and the regularization function of SAM. We also present a novel training procedure named Gradient norm Aware Minimization (GAM) to seek minima with uniformly small curvature across all directions. Experimental results show that GAM improves the generalization of models trained with current optimizers such as SGD and AdamW on various datasets and networks. Furthermore, we show that GAM can help SAM find flatter minima and achieve better generalization.
- Neon2: Finding local minima via first-order oracles. Advances in Neural Information Processing Systems, 31, 2018.
- A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
- Unsupervised label noise modeling and loss correction. In International conference on machine learning, pages 312–321. PMLR, 2019.
- On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, pages 244–253. PMLR, 2018.
- Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM), 58(2):1–34, 2011.
- Some large-scale matrix computation problems. Journal of Computational and Applied Mathematics, 74(1-2):71–89, 1996.
- Implicit gradient regularization. arXiv preprint arXiv:2009.11162, 2020.
- Food-101–mining discriminative components with random forests. In European conference on computer vision, pages 446–461. Springer, 2014.
- Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
- Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763, 2018.
- When vision transformers outperform resnets without pre-training or strong data augmentations. arXiv preprint arXiv:2106.01548, 2021.
- Gradient descent on neural networks typically occurs at the edge of stability. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Efficient sharpness-aware minimization for improved training of neural networks. arXiv preprint arXiv:2110.03141, 2021.
- Sharpness-aware training for free. arXiv preprint arXiv:2205.14083, 2022.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021.
- Deep pyramidal residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5927–5935, 2017.
- Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225–1234. PMLR, 2016.
- Asymmetric valleys: Beyond sharp and flat local minima. Advances in neural information processing systems, 32, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
- Simplifying neural nets by discovering flat minima. Advances in neural information processing systems, 7, 1994.
- Convolutional networks with dense connectivity. IEEE transactions on pattern analysis and machine intelligence, 2019.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. PMLR, 2015.
- Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
- Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
- The break-even point on optimization trajectories of deep neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- Information-theoretic local minima characterization and regularization. In International Conference on Machine Learning, pages 4773–4783. PMLR, 2020.
- Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning, pages 4804–4815. PMLR, 2020.
- Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178, 2019.
- On the maximum hessian eigenvalue and generalization. arXiv preprint arXiv:2206.10654, 2022.
- On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.
- Sharpness-aware minimization for worst case optimization. arXiv preprint arXiv:2210.13533, 2022.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, International Conference on Learning Representations, 2015.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
- Learning multiple layers of features from tiny images. Citeseer, 2009.
- Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning, pages 5905–5914. PMLR, 2021.
- Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
- The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
- Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
- On the variance of the adaptive learning rate and beyond. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- Towards efficient and scalable sharpness-aware minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12360–12370, 2022.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
- Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18:1–35, 2017.
- Make sharpness-aware minimization stronger: A sparsified perturbation approach. arXiv preprint arXiv:2210.05177, 2022.
- Yu E Nesterov. A method for solving the convex programming problem with convergence rate. In Dokl. Akad. Nauk SSSR,, volume 269, pages 543–547, 1983.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
- Relative flatness and generalization. Advances in Neural Information Processing Systems, 34:18420–18432, 2021.
- On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Woodfisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 33:18098–18109, 2020.
- A bayesian perspective on generalization and stochastic gradient descent. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- An empirical study of large-batch stochastic gradient descent with structured covariance noise. arXiv preprint arXiv:1902.08234, 2019.
- The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems, 30, 2017.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
- A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations, 2021.
- On the power-law spectrum in deep learning: A bridge to protein science. arXiv preprint arXiv:2201.13011, 2022.
- Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International Conference on Machine Learning, pages 24430–24459. PMLR, 2022.
- Grod: Deep learning with gradients orthogonal decomposition for knowledge transfer, distillation, and adversarial training. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(6):1–25, 2022.
- First-order stochastic algorithms for escaping from saddle points in almost linear time. Advances in neural information processing systems, 31, 2018.
- Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE international conference on big data (Big data), pages 581–590. IEEE, 2020.
- Hessian-based analysis of large batch training and robustness to adversaries. Advances in Neural Information Processing Systems, 31, 2018.
- Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine, 13(3):55–75, 2018.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
- Adaptive methods for nonconvex optimization. Advances in neural information processing systems, 31, 2018.
- Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
- Spatio-temporal fusion based convolutional sequence learning for lip reading. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 713–722, 2019.
- Deep stable learning for out-of-distribution generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5372–5382, 2021.
- Towards unsupervised domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4910–4920, 2022.
- Penalizing gradient norm for efficiently improving generalization in deep learning. In International Conference on Machine Learning, pages 26982–26992. PMLR, 2022.
- Improving sharpness-aware minimization with fisher mask for better generalization on language models. arXiv preprint arXiv:2210.05497, 2022.
- Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
- A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.
- Surrogate gap minimization improves sharpness-aware training. In International Conference on Learning Representations, 2022.