How to Fine-Tune Vision Models with SGD (2211.09359v2)

Published 17 Nov 2022 in cs.CV and cs.LG

Abstract: SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViT-L, SGD uses 33% less GPU memory). Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet.

References (43)

Citations (26)

View on Semantic Scholar

Summary

The paper demonstrates that freezing the embedding layer during SGD fine-tuning improves performance on tasks with distribution shifts.
It shows that SGD, despite its simplicity and lower memory footprint, can rival AdamW when parameters are strategically managed.
Empirical evaluations across seven architectures reveal that SGD can use up to 33% less GPU memory while maintaining competitive accuracy.

An Analysis of Fine-Tuning Vision Models with SGD

The paper "How to Fine-Tune Vision Models with SGD" presents an empirical and analytical comparison between stochastic gradient descent (SGD) and AdamW optimization techniques for fine-tuning modern computer vision models. The paper explores the nuances of model performance across a variety of distribution shift benchmarks and offers a compelling argument for the effectiveness of SGD, particularly when augmented by specific strategies.

Key Insights and Contributions

The paper begins with an observation that although AdamW has been the optimizer of choice for pretraining a range of vision architectures owing to its adaptive learning rate, SGD—a historically simpler and more memory-efficient optimizer—can perform comparably or even better when fine-tuning these models on new tasks, especially if specific model parameters are strategically manipulated.

Methodology

Optimizer Memory Footprint: SGD is significantly less memory-intensive than AdamW. Without momentum, SGD uses only 8 bytes per parameter compared to 16 bytes for AdamW. Even with momentum, SGD's memory consumption is less than that of AdamW.
Empirical Evaluation: The authors evaluate the performance of SGD and AdamW across seven state-of-the-art architectures, including Vision Transformers (ViT), ConvNeXt, and ResNets, using five distribution shift datasets: WILDS-FMoW, WILDS-Camelyon, Waterbirds, BREEDS-Living-17, and DomainNet.
Distribution Shifts: The paper emphasizes performance under distribution shifts—settings where the model operates on data distributions that differ from training data, making the task inherently more challenging.
Freezing Embedding Layers: A pivotal discovery is that freezing the first "embedding" layer—constituting less than 1% of total parameters—during fine-tuning with SGD brings its performance closer to, or better than, AdamW's performance. This modification significantly improves efficiency while utilizing much less GPU memory.

Results and Observations

AdamW vs. SGD Performance: While AdamW generally outperforms SGD when the latter is used naively, the gains of AdamW over SGD diminish significantly when the embedding layers are frozen, allowing SGD to use its computational efficiency to its full advantage. Notably, on models like ViT-L, SGD uses approximately 33% less GPU memory than AdamW under these conditions.
Out-of-Distribution (OOD) Gains: The paper reports substantial improvements in OOD performance, with the modified SGD achieving state-of-the-art results on the distribution shift benchmarks. SGD (with freezing) even outperforms AdamW on five evaluated datasets.
Gradient Analysis: The paper reveals that large disparities in gradient magnitudes across layers predict low performance of SGD. Specifically, significantly larger gradients in the first embedding layer correlate with performance drop-offs. This understanding informs the strategy of layer freezing.
Practical Implications: The findings are practically significant, as they offer a more memory-efficient alternative to AdamW while ensuring comparable or superior accuracy, facilitating the deployment of large models in resource-constrained environments.

Implications and Future Directions

The insights presented indicate that a simple modification to the fine-tuning process can lead to broad improvements in the efficiency and robustness of vision models. From a theoretical standpoint, the notion that careful parameter management can offset the lack of adaptive gradient methods during fine-tuning leads to potential new research into training algorithms that balance memory efficiency with adaptive capabilities.

Given the trends identified in this research, further exploration into optimizers that inherently adapt layer-wise learning rates without incurring high memory costs could prove beneficial. Additionally, the impact of these findings on other machine learning domains, such as natural language processing or reinforcement learning, remains an open and intriguing area for inquiry.

In conclusion, the paper makes a compelling case that with a nuanced understanding of model dynamics, SGD can rival more complex optimization strategies while offering significant computational savings, thus unlocking new opportunities for scalable and efficient deep model deployments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ananyaku/status/1788932259401039882