Sharpness-Aware Training for Free (2205.14083v3)

Published 27 May 2022 in cs.LG, cs.AI, and cs.CV

Abstract: Modern deep neural networks (DNNs) have achieved state-of-the-art performances but are typically over-parameterized. The over-parameterization may result in undesirably large generalization error in the absence of other customized training strategies. Recently, a line of research under the name of Sharpness-Aware Minimization (SAM) has shown that minimizing a sharpness measure, which reflects the geometry of the loss landscape, can significantly reduce the generalization error. However, SAM-like methods incur a two-fold computational overhead of the given base optimizer (e.g. SGD) for approximating the sharpness measure. In this paper, we propose Sharpness-Aware Training for Free, or SAF, which mitigates the sharp landscape at almost zero additional computational cost over the base optimizer. Intuitively, SAF achieves this by avoiding sudden drops in the loss in the sharp local minima throughout the trajectory of the updates of the weights. Specifically, we suggest a novel trajectory loss, based on the KL-divergence between the outputs of DNNs with the current weights and past weights, as a replacement of the SAM's sharpness measure. This loss captures the rate of change of the training loss along the model's update trajectory. By minimizing it, SAF ensures the convergence to a flat minimum with improved generalization capabilities. Extensive empirical results show that SAF minimizes the sharpness in the same way that SAM does, yielding better results on the ImageNet dataset with essentially the same computational cost as the base optimizer.

Citations (80)

View on Semantic Scholar

Summary

The paper introduces SAF, which leverages KL-divergence between current and past outputs to approximate SAM efficiently.
It employs a trajectory loss to guide training towards flat minima, effectively reducing overfitting in deep neural networks.
Empirical tests on benchmarks like ImageNet show SAF outperforms SAM while cutting computational overhead in half.

Sharpness-Aware Training for Free

The paper "Sharpness-Aware Training for Free" introduces a novel approach to enhance the generalization performance of deep neural networks (DNNs) without incurring additional computational costs typically associated with Sharpness-Aware Minimization (SAM). The authors address the challenge of over-parameterization in modern DNNs which often leads to large generalization errors. This is primarily aimed at tackling the overfitting problem by converging to flat minima rather than sharp ones, which are associated with worse generalization performance.

Overview

The paper critiques the existing methods such as SAM, which, despite their effectiveness in reducing generalization errors, require approximately twice the computational resources as standard optimizers like Stochastic Gradient Descent (SGD). SAM achieves this by explicitly penalizing the sharpness of the loss landscape, which, in turn, necessitates a two-fold computational overhead for the estimation and regularization of the sharpness measure.

To overcome these limitations, the authors propose Sharpness-Aware Training for Free (SAF). The SAF approach circumvents the computational burden by utilizing the KL-divergence between the present and past model outputs, captured as a trajectory loss. This is leveraged as a proxy to approximate sharpness, allowing SAF to retain the benefits of SAM while operating at the computational cost comparable to standard training routines.

Key Contributions

Trajectory Loss: The paper introduces a trajectory loss that uses the KL-divergence between outputs of networks with updated weights and past weights, thus effectively quantifying changes in sharpness with negligible computational overhead.
Empirical Validation: Extensive experiments demonstrate that SAF achieves better generalization and minimizes sharpness equivalently to SAM, validating its efficiency on benchmark datasets such as ImageNet with no additional computational overhead relative to the base optimizer.
Memory-Efficient Variant: SAF is extended to Memory-Efficient Sharpness-Aware Training (MESA), which addresses storage constraints on extremely large datasets, further contributing to its versatility across various data scales and architectures.

Numerical Results

SAF outperforms SAM and its variants across multiple architectures including ResNets and Vision Transformers with significant reductions in the computational time. For instance, on ImageNet, SAF achieves near state-of-the-art results with nearly twice the speed of SAM, and MESA provides an efficient balance between memory usage and computational workload.

Implications and Future Directions

The introduction of SAF has both practical and theoretical implications. Practically, it reduces the barrier for deploying sharpness-aware strategies in resource-constrained environments. Theoretically, it sparks new dialogue about how the trajectory loss can be further optimized or adapted across different training methodologies.

Future research can explore the automatic adaptation of SAF-like techniques, potentially integrating them dynamically based on training conditions and dataset characteristics. Additionally, applying SAF to domains beyond image classification, such as natural language processing and time-series forecasting, may yield further benefits.

In conclusion, this work advances the state-of-art in generalization-focused training strategies by offering a computationally efficient, sharpness-aware optimizer framework. SAF and its variant MESA promise to broaden the applications of sharpness-aware methods across diverse domains and data scales.

PDF Markdown

Related Papers

GitHub

GitHub - AngusDujw/SAF (36 stars)

Tweets

https://twitter.com/evaninwords/status/1871230495725597106

YouTube

Show All Videos