How to Fine-Tune Vision Models with SGD
Abstract: SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViT-L, SGD uses 33% less GPU memory). Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet.
- Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations (ICLR), 2021.
- From detection of individual metastases to classification of lymph node status at the patient level: the CAMELYON17 challenge. IEEE Transactions on Medical Imaging, 38(2):550–560, 2018.
- Emerging properties in self-supervised vision transformers. In International Conference on Computer Vision (ICCV), 2021.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), pp. 1597–1607, 2020.
- Functional map of the world. In Computer Vision and Pattern Recognition (CVPR), 2018.
- 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Computer Vision and Pattern Recognition (CVPR), 2017.
- Are vision transformers robust to spurious correlations? arXiv preprint arXiv:2008.02790, 2022.
- Training deep networks with stochastic gradient normalized by layerwise adaptive second moments. 2019.
- Spottune: Transfer learning through adaptive fine-tuning. In Computer Vision and Pattern Recognition (CVPR), 2019.
- Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
- Universal language model fine-tuning for text classification. In Association for Computational Linguistics (ACL), 2018.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In International Conference on Learning Representations (ICLR), 2021.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning (ICML), 2021.
- Big transfer (bit): General visual representation learning. In ECCV, 2020.
- Do better imagenet models transfer better? In Computer Vision and Pattern Recognition (CVPR), 2019.
- Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations (ICLR), 2022.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. In Association for Computational Linguistics (ACL), 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- A convnet for the 2020s. In Computer Vision and Pattern Recognition (CVPR), 2022.
- Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision, pp. 2200–2207, 2013.
- Decoupled weight decay regularization. In ICLR, 2019.
- Moment matching for multi-source domain adaptation. In International Conference on Computer Vision (ICCV), 2019.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), volume 139, pp. 8748–8763, 2021.
- Model-based domain generalization. In NeurIPS, 2021.
- Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), 2020.
- Breeds: Benchmarks for subpopulation shift. arXiv, 2020.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning (ICML), 2018.
- How to train your vit? data, augmentation, and regularization in vision transformers. ArXiv, abs/2106.10270, 2021.
- Class-imbalanced domain adaptation: An empirical odyssey. arXiv preprint arXiv:1910.10320, 2020.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Robust fine-tuning of zero-shot models. arXiv preprint arXiv:2109.01903, 2021.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning (ICML), 2022.
- Deep gaussian process for crop yield prediction based on remote sensing data. In Association for the Advancement of Artificial Intelligence (AAAI), 2017a.
- Large batch training of convolutional networks. arXiv: Computer Vision and Pattern Recognition, 2017b.
- Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations (ICLR), 2020.
- A large-scale study of representation learning with the visual task adaptation benchmark. arXiv, 2020.
- Side-tuning: A baseline for network adaptation via additive side networks. In European Conference on Computer Vision (ECCV), 2020.
- FreeLB: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations (ICLR), 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.