Switch EMA: A Free Lunch for Better Flatness and Sharpness (2402.09240v2)

Published 14 Feb 2024 in cs.LG and cs.CV

Abstract: Exponential Moving Average (EMA) is a widely used weight averaging (WA) regularization to learn flat optima for better generalizations without extra cost in deep neural network (DNN) optimization. Despite achieving better flatness, existing WA methods might fall into worse final performances or require extra test-time computations. This work unveils the full potential of EMA with a single line of modification, i.e., switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA). From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness. To verify the effectiveness of SEMA, we conduct comparison experiments with discriminative, generative, and regression tasks on vision and language datasets, including image classification, self-supervised learning, object detection and segmentation, image generation, video prediction, attribute regression, and LLMing. Comprehensive results with popular optimizers and networks show that SEMA is a free lunch for DNN training by improving performances and boosting convergence speeds.

Authors (12)

Siyuan Li (140 papers)
Zicheng Liu (153 papers)
Juanxi Tian (8 papers)
Ge Wang (214 papers)
Zedong Wang (15 papers)
Weiyang Jin (7 papers)
Di Wu (477 papers)
Cheng Tan (140 papers)
Tao Lin (167 papers)
Yang Liu (2253 papers)
Baigui Sun (41 papers)
Stan Z. Li (222 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that switching EMA parameters can effectively balance flatness and sharpness to enhance convergence and generalization in deep neural networks.
The SEMA method integrates conventional EMA with periodic resets to original weights, achieving notable top-1 accuracy gains on benchmarks like ImageNet-1K.
Empirical and theoretical analyses reveal SEMA's stability benefits and its broad applicability across diverse tasks, from image classification to language modeling.

Analyzing "Switch EMA: A Free Lunch for Better Flatness and Sharpness"

The paper "Switch EMA: A Free Lunch for Better Flatness and Sharpness" explores how a simple modification to Exponential Moving Average (EMA) can serve as an effective regularizer to enhance convergence speed and model performance for deep neural networks (DNNs). The authors propose a technique named Switch EMA (SEMA), which entails switching the EMA parameters back to the original model after each epoch. This method seeks to reconcile the balance between flatness and sharpness in model optimization, thereby achieving better generalization capabilities.

Overview of EMA's Utility and Limitations

EMA is widely recognized as a weight averaging (WA) regularization tool used to learn flat optima, improving the generalization of DNNs without incurring extra computational costs. However, while EMA achieves better flatness, existing WA methods either result in suboptimal final performances or require additional computations during the test phase. The paper presents SEMA as an enhancement of generic EMA, where the pivotal step involves switching and integrating EMA with the model’s original parameters at regular intervals, which induces a trade-off between flatness and sharpness, making the model more poised for better generalization.

Key Numerical Results

The paper reports comprehensive empirical evaluations across various discriminative, generative, and regression tasks involving popular datasets in vision and language domains such as CIFAR-100, ImageNet-1K, COCO, and others. Noteworthy improvements are observed when applying SEMA compared to basic optimization or other WA techniques. For instance, SEMA displays significant improvements in top-1 accuracy in benchmarks like ImageNet-1K by several percentage points across various model architectures and optimization strategies, including SGD, SAM, and AdamW variants.

Implementation and Theoretical Contributions

From both theoretical and empirical perspectives, SEMA is noted to facilitate faster convergence by incorporating both flatness and sharpness, evidenced by the visualization of loss landscapes and decision boundaries. The authors provide empirical proof through performance gains and convergence speedups when juxtaposed against other WA methods and optimizers like Adam and LARS across a spectrum of tasks, including image classification, self-supervised learning, object detection, image generation, video prediction, attribute regression, and LLMing.

SEMA's core contribution is in the unique ability to switch between the EMA and original model parameters, enabling it to maintain both the explorative benefits of previous weight averages (EMA) and sharpened descent to deeper local minima. This behavior is theoretically backed by analyses suggesting reduced low-frequency oscillations, showcasing stability benefits that other WA methods do not offer.

Future Directions and Implications

The multifaceted utility of SEMA highlights promising avenues for future research, particularly in how its principles might be adapted to other neural network architectures or extended to other domains like natural language understanding and robotics. One potential avenue is exploring different switching interval strategies or dynamic momentum adaptation schemes for specific learning tasks. Furthermore, SEMA's design advocates for an optimization-agnostic method that integrates effortlessly with various existing and future training paradigms, potentially broadening its application across even emerging areas such as neural architecture search and automated AI systems.

This paper contributes to the ongoing narrative in AI research that focuses on computational efficiency and model performance improvement by simple yet effective modifications to existing methodologies. While SEMA offers considerable practical benefits without additional computational burdens, its theoretical implications add a new dimension to the understanding of model optimization dynamics in learning systems—a direction promising for ongoing advancements in AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JuanxiTian/status/1844775408962576623

https://twitter.com/Euclaise_/status/1768687242166387003

https://twitter.com/LupinLSY/status/1784675064656560413

https://twitter.com/LupinLSY/status/1844781428912853284

https://twitter.com/Euclaise_/status/1759630726109044981

https://twitter.com/CompsciDiscu/status/1847323235919991067

Reddit

[R] Switch EMA: A Free Lunch for Better Flatness and Sharpness (47 points, 9 comments)