How to Fine-Tune Vision Models with SGD (2211.09359v2)
Abstract: SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViT-L, SGD uses 33% less GPU memory). Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet.
- Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations (ICLR), 2021.
- From detection of individual metastases to classification of lymph node status at the patient level: the CAMELYON17 challenge. IEEE Transactions on Medical Imaging, 38(2):550–560, 2018.
- Emerging properties in self-supervised vision transformers. In International Conference on Computer Vision (ICCV), 2021.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), pp. 1597–1607, 2020.
- Functional map of the world. In Computer Vision and Pattern Recognition (CVPR), 2018.
- 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Computer Vision and Pattern Recognition (CVPR), 2017.
- Are vision transformers robust to spurious correlations? arXiv preprint arXiv:2008.02790, 2022.
- Training deep networks with stochastic gradient normalized by layerwise adaptive second moments. 2019.
- Spottune: Transfer learning through adaptive fine-tuning. In Computer Vision and Pattern Recognition (CVPR), 2019.
- Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
- Universal language model fine-tuning for text classification. In Association for Computational Linguistics (ACL), 2018.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In International Conference on Learning Representations (ICLR), 2021.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning (ICML), 2021.
- Big transfer (bit): General visual representation learning. In ECCV, 2020.
- Do better imagenet models transfer better? In Computer Vision and Pattern Recognition (CVPR), 2019.
- Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations (ICLR), 2022.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. In Association for Computational Linguistics (ACL), 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- A convnet for the 2020s. In Computer Vision and Pattern Recognition (CVPR), 2022.
- Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision, pp. 2200–2207, 2013.
- Decoupled weight decay regularization. In ICLR, 2019.
- Moment matching for multi-source domain adaptation. In International Conference on Computer Vision (ICCV), 2019.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), volume 139, pp. 8748–8763, 2021.
- Model-based domain generalization. In NeurIPS, 2021.
- Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), 2020.
- Breeds: Benchmarks for subpopulation shift. arXiv, 2020.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning (ICML), 2018.
- How to train your vit? data, augmentation, and regularization in vision transformers. ArXiv, abs/2106.10270, 2021.
- Class-imbalanced domain adaptation: An empirical odyssey. arXiv preprint arXiv:1910.10320, 2020.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Robust fine-tuning of zero-shot models. arXiv preprint arXiv:2109.01903, 2021.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning (ICML), 2022.
- Deep gaussian process for crop yield prediction based on remote sensing data. In Association for the Advancement of Artificial Intelligence (AAAI), 2017a.
- Large batch training of convolutional networks. arXiv: Computer Vision and Pattern Recognition, 2017b.
- Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations (ICLR), 2020.
- A large-scale study of representation learning with the visual task adaptation benchmark. arXiv, 2020.
- Side-tuning: A baseline for network adaptation via additive side networks. In European Conference on Computer Vision (ECCV), 2020.
- FreeLB: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations (ICLR), 2020.