Quantified Task Misalignment to Inform PEFT: An Exploration of Domain Generalization and Catastrophic Forgetting in CLIP (2402.09613v1)
Abstract: Foundations models are presented as generalists that often perform well over a myriad of tasks. Fine-tuning these models, even on limited data, provides an additional boost in task-specific performance but often at the cost of their wider generalization, an effect termed catastrophic forgetting. In this paper, we analyze the relation between task difficulty in the CLIP model and the performance of several simple parameter-efficient fine-tuning methods through the lens of domain generalization and catastrophic forgetting. We provide evidence that the silhouette score of the zero-shot image and text embeddings is a better measure of task difficulty than the average cosine similarity of correct image/label embeddings, and discuss observable relationships between task difficulty, fine-tuning method, domain generalization, and catastrophic forgetting. Additionally, the averaged results across tasks and performance measures demonstrate that a simplified method that trains only a subset of attention weights, which we call A-CLIP, yields a balance between domain generalization and catastrophic forgetting.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, July 2021.
- Don’t Stop Learning: Towards Continual Learning for the CLIP Model, July 2022. URL http://arxiv.org/abs/2207.09248.
- Continual Vision-Language Representation Learning with Off-Diagonal Information, June 2023. URL https://arxiv.org/abs/2305.07437.
- Three things everyone should know about vision transformers. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, page 497–515, Berlin, Heidelberg, 2022. Springer-Verlag. ISBN 978-3-031-20052-6. doi: 10.1007/978-3-031-20053-3˙29. URL https://doi.org/10.1007/978-3-031-20053-3_29.
- Descriptor and word soups: Overcoming the parameter efficiency accuracy tradeoff for out-of-distribution few-shot learning, 2023. URL https://arxiv.org/pdf/2311.13612.pdf.
- Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989. ISBN 978-0-12-543324-2. doi: 10.1016/S0079-7421(08)60536-8.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, March 2017. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1611835114.
- Self-supervised class incremental learning, 2021. URL https://arxiv.org/abs/2111.11208.
- How well does self-supervised pre-training perform with streaming data?, 2022a. URL https://arxiv.org/abs/2104.12081.
- Speciality vs Generality: An Empirical Study on Catastrophic Forgetting in Fine-tuning Foundation Models, October 2023. URL https://arxiv.org/abs/2309.06256.
- CLiMB: A continual learning benchmark for vision-and-language tasks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 29440–29453. Curran Associates, Inc., 2022.
- A unified continuous learning framework for multi-modal knowledge discovery and pre-training, 2022. URL https://arxiv.org/pdf/2206.05555.pdf.
- Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
- Synaptic metaplasticity in binarized neural networks. Nature Communications, 12(1):2549, May 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-22768-y.
- CLIP-Adapter: Better Vision-Language Models with Feature Adapters. International Journal of Computer Vision, 132(2):581–595, February 2023. ISSN 0920-5691, 1573-1405. doi: 10.1007/s11263-023-01891-x.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022b. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.1.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, June 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
- Functional Map of the World. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6172–6180, Salt Lake City, UT, June 2018. IEEE. ISBN 978-1-5386-6420-9. doi: 10.1109/CVPR.2018.00646. URL https://ieeexplore.ieee.org/document/8578744/.
- Wilds: A benchmark of in-the-wild distribution shifts. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5637–5664. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/koh21a.html.
- 3d object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013. doi: 10.1109/ICCVW.2013.77.
- Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32:323–332, 2012. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2012.02.016. URL https://www.sciencedirect.com/science/article/pii/S0893608012000457. Selected Papers from IJCNN 2011.
- Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf.
- Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
- Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
- Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, June 2010. doi: 10.1109/CVPR.2010.5539970.