Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How to Fine-Tune Vision Models with SGD (2211.09359v2)

Published 17 Nov 2022 in cs.CV and cs.LG

Abstract: SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViT-L, SGD uses 33% less GPU memory). Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations (ICLR), 2021.
  2. From detection of individual metastases to classification of lymph node status at the patient level: the CAMELYON17 challenge. IEEE Transactions on Medical Imaging, 38(2):550–560, 2018.
  3. Emerging properties in self-supervised vision transformers. In International Conference on Computer Vision (ICCV), 2021.
  4. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), pp. 1597–1607, 2020.
  5. Functional map of the world. In Computer Vision and Pattern Recognition (CVPR), 2018.
  6. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  8. Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Computer Vision and Pattern Recognition (CVPR), 2017.
  9. Are vision transformers robust to spurious correlations? arXiv preprint arXiv:2008.02790, 2022.
  10. Training deep networks with stochastic gradient normalized by layerwise adaptive second moments. 2019.
  11. Spottune: Transfer learning through adaptive fine-tuning. In Computer Vision and Pattern Recognition (CVPR), 2019.
  12. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
  13. Universal language model fine-tuning for text classification. In Association for Computational Linguistics (ACL), 2018.
  14. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  15. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In International Conference on Learning Representations (ICLR), 2021.
  16. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  17. WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning (ICML), 2021.
  18. Big transfer (bit): General visual representation learning. In ECCV, 2020.
  19. Do better imagenet models transfer better? In Computer Vision and Pattern Recognition (CVPR), 2019.
  20. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations (ICLR), 2022.
  21. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  22. Prefix-tuning: Optimizing continuous prompts for generation. In Association for Computational Linguistics (ACL), 2021.
  23. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  24. A convnet for the 2020s. In Computer Vision and Pattern Recognition (CVPR), 2022.
  25. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision, pp.  2200–2207, 2013.
  26. Decoupled weight decay regularization. In ICLR, 2019.
  27. Moment matching for multi-source domain adaptation. In International Conference on Computer Vision (ICCV), 2019.
  28. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), volume 139, pp.  8748–8763, 2021.
  29. Model-based domain generalization. In NeurIPS, 2021.
  30. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), 2020.
  31. Breeds: Benchmarks for subpopulation shift. arXiv, 2020.
  32. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning (ICML), 2018.
  33. How to train your vit? data, augmentation, and regularization in vision transformers. ArXiv, abs/2106.10270, 2021.
  34. Class-imbalanced domain adaptation: An empirical odyssey. arXiv preprint arXiv:1910.10320, 2020.
  35. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  36. Robust fine-tuning of zero-shot models. arXiv preprint arXiv:2109.01903, 2021.
  37. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning (ICML), 2022.
  38. Deep gaussian process for crop yield prediction based on remote sensing data. In Association for the Advancement of Artificial Intelligence (AAAI), 2017a.
  39. Large batch training of convolutional networks. arXiv: Computer Vision and Pattern Recognition, 2017b.
  40. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations (ICLR), 2020.
  41. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv, 2020.
  42. Side-tuning: A baseline for network adaptation via additive side networks. In European Conference on Computer Vision (ECCV), 2020.
  43. FreeLB: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations (ICLR), 2020.
Citations (26)

Summary

  • The paper demonstrates that freezing the embedding layer during SGD fine-tuning improves performance on tasks with distribution shifts.
  • It shows that SGD, despite its simplicity and lower memory footprint, can rival AdamW when parameters are strategically managed.
  • Empirical evaluations across seven architectures reveal that SGD can use up to 33% less GPU memory while maintaining competitive accuracy.

An Analysis of Fine-Tuning Vision Models with SGD

The paper "How to Fine-Tune Vision Models with SGD" presents an empirical and analytical comparison between stochastic gradient descent (SGD) and AdamW optimization techniques for fine-tuning modern computer vision models. The paper explores the nuances of model performance across a variety of distribution shift benchmarks and offers a compelling argument for the effectiveness of SGD, particularly when augmented by specific strategies.

Key Insights and Contributions

The paper begins with an observation that although AdamW has been the optimizer of choice for pretraining a range of vision architectures owing to its adaptive learning rate, SGD—a historically simpler and more memory-efficient optimizer—can perform comparably or even better when fine-tuning these models on new tasks, especially if specific model parameters are strategically manipulated.

Methodology

  1. Optimizer Memory Footprint: SGD is significantly less memory-intensive than AdamW. Without momentum, SGD uses only 8 bytes per parameter compared to 16 bytes for AdamW. Even with momentum, SGD's memory consumption is less than that of AdamW.
  2. Empirical Evaluation: The authors evaluate the performance of SGD and AdamW across seven state-of-the-art architectures, including Vision Transformers (ViT), ConvNeXt, and ResNets, using five distribution shift datasets: WILDS-FMoW, WILDS-Camelyon, Waterbirds, BREEDS-Living-17, and DomainNet.
  3. Distribution Shifts: The paper emphasizes performance under distribution shifts—settings where the model operates on data distributions that differ from training data, making the task inherently more challenging.
  4. Freezing Embedding Layers: A pivotal discovery is that freezing the first "embedding" layer—constituting less than 1% of total parameters—during fine-tuning with SGD brings its performance closer to, or better than, AdamW's performance. This modification significantly improves efficiency while utilizing much less GPU memory.

Results and Observations

  1. AdamW vs. SGD Performance: While AdamW generally outperforms SGD when the latter is used naively, the gains of AdamW over SGD diminish significantly when the embedding layers are frozen, allowing SGD to use its computational efficiency to its full advantage. Notably, on models like ViT-L, SGD uses approximately 33% less GPU memory than AdamW under these conditions.
  2. Out-of-Distribution (OOD) Gains: The paper reports substantial improvements in OOD performance, with the modified SGD achieving state-of-the-art results on the distribution shift benchmarks. SGD (with freezing) even outperforms AdamW on five evaluated datasets.
  3. Gradient Analysis: The paper reveals that large disparities in gradient magnitudes across layers predict low performance of SGD. Specifically, significantly larger gradients in the first embedding layer correlate with performance drop-offs. This understanding informs the strategy of layer freezing.
  4. Practical Implications: The findings are practically significant, as they offer a more memory-efficient alternative to AdamW while ensuring comparable or superior accuracy, facilitating the deployment of large models in resource-constrained environments.

Implications and Future Directions

The insights presented indicate that a simple modification to the fine-tuning process can lead to broad improvements in the efficiency and robustness of vision models. From a theoretical standpoint, the notion that careful parameter management can offset the lack of adaptive gradient methods during fine-tuning leads to potential new research into training algorithms that balance memory efficiency with adaptive capabilities.

Given the trends identified in this research, further exploration into optimizers that inherently adapt layer-wise learning rates without incurring high memory costs could prove beneficial. Additionally, the impact of these findings on other machine learning domains, such as natural language processing or reinforcement learning, remains an open and intriguing area for inquiry.

In conclusion, the paper makes a compelling case that with a nuanced understanding of model dynamics, SGD can rival more complex optimization strategies while offering significant computational savings, thus unlocking new opportunities for scalable and efficient deep model deployments.

X Twitter Logo Streamline Icon: https://streamlinehq.com