Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders (2407.15731v1)
Abstract: Despite the proliferation of large vision-language foundation models, estimation of the learning and forgetting outcomes following fine-tuning of these models remains largely unexplored. Inspired by work highlighting the significance of the modality gap in contrastive dual-encoders, we propose the Inter-Intra Modal Measure (IIMM). Combining terms quantifying the similarity between image embeddings and the similarity between incorrect image and label embedding pairs, the IIMM functions as a strong predictor of performance changes with fine-tuning. Our extensive empirical analysis across four state-of-the-art vision-LLMs (CLIP, SigLIP, CoCa, EVA-02-CLIP) and five fine-tuning techniques (full fine-tuning, BitFit, attention-weight tuning, LoRA, CLIP-Adapter) demonstrates a strong, statistically significant linear relationship: fine-tuning on tasks with higher IIMM scores produces greater in-domain performance gains but also induces more severe out-of-domain performance degradation, with some parameter-efficient fine-tuning (PEFT) methods showing extreme forgetting. We compare our measure against transfer scores from state-of-the-art model selection methods and show that the IIMM is significantly more predictive of accuracy gains. With only a single forward pass of the target data, practitioners can leverage this key insight to heuristically evaluate the degree to which a model can be expected to improve following fine-tuning. Given additional knowledge about the model's performance on a few diverse tasks, this heuristic further evolves into a strong predictor of expected performance changes when training for new tasks.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, July 2021a.
- Don’t Stop Learning: Towards Continual Learning for the CLIP Model, July 2022, 2207.09248.
- Continual Vision-Language Representation Learning with Off-Diagonal Information, June 2023, 2305.07437.
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, July 2021b.
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 4904–4916. PMLR, July 2021.
- Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems, volume 34, pages 9694–9705. Curran Associates, Inc., 2021.
- VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6787–6800, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
- Contrastive Learning of Medical Visual Representations from Paired Images and Text. In Proceedings of the 7th Machine Learning for Healthcare Conference, pages 2–25. PMLR, December 2022.
- Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. In First Workshop of Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022, October 2022, 2203.02053.
- Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989. ISBN 978-0-12-543324-2.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, March 2017. ISSN 0027-8424, 1091-6490.
- Self-supervised class incremental learning, 2021, 2111.11208. URL https://arxiv.org/abs/2111.11208.
- How well does self-supervised pre-training perform with streaming data?, 2022a, 2104.12081. URL https://arxiv.org/abs/2104.12081.
- Speciality vs Generality: An Empirical Study on Catastrophic Forgetting in Fine-tuning Foundation Models, October 2023, 2309.06256.
- CLiMB: A continual learning benchmark for vision-and-language tasks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 29440–29453. Curran Associates, Inc., 2022.
- A unified continuous learning framework for multi-modal knowledge discovery and pre-training, 2022, 2206.05555. URL https://arxiv.org/pdf/2206.05555.pdf.
- Representation Similarity Analysis for Efficient Task Taxonomy & Transfer Learning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12379–12388, Long Beach, CA, USA, June 2019. IEEE. ISBN 978-1-72813-293-8.
- Guided Recommendation for Model Fine-Tuning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3633–3642, Vancouver, BC, Canada, June 2023. IEEE. ISBN 9798350301298.
- Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning, June 2018, 1806.06193.
- LEEP: A New Measure to Evaluate Transferability of Learned Representations. In Proceedings of the 37th International Conference on Machine Learning, pages 7294–7305. PMLR, November 2020.
- Transferability and Hardness of Supervised Classification Tasks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1395–1405, Seoul, Korea (South), October 2019. IEEE. ISBN 978-1-72814-803-8.
- An Information-Theoretic Approach to Transferability in Task Transfer Learning. In 2019 IEEE International Conference on Image Processing (ICIP), pages 2309–2313, Taipei, Taiwan, September 2019. IEEE. ISBN 978-1-5386-6249-6.
- Logme: Practical assessment of pre-trained models for transfer learning. In International Conference on Machine Learning, pages 12133–12143. PMLR, 2021.
- Frustratingly easy transferability estimation. In International Conference on Machine Learning, pages 9201–9225. PMLR, 2022.
- Foundation model is efficient multimodal multitask model selector. Advances in Neural Information Processing Systems, 36, 2024.
- LOVM: Language-Only Vision Model Selection. In Conference on Neural Information Processing Systems (Neurips), 2023.
- Bridge the Modality and Capacity Gaps in Vision-Language Model Selection, March 2024, 2403.13797.
- Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In Proceedings of the 37th International Conference on Machine Learning, pages 9929–9939. PMLR, November 2020.
- Michael Welle. UNDERSTANDING THE MODALITY GAP IN CLIP. In ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023.
- 3D Object Representations for Fine-Grained Categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, Sydney, Australia, December 2013. IEEE. ISBN 978-1-4799-3022-7.
- Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images, 2009.
- Describing Textures in the Wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, Columbus, OH, USA, June 2014. IEEE. ISBN 978-1-4799-5118-5.
- Introducing Eurosat: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. In IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, pages 204–207, Valencia, July 2018. IEEE. ISBN 978-1-5386-7150-4.
- Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proceedings of the IEEE, 105(10):1865–1883, October 2017. ISSN 0018-9219, 1558-2256.
- Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32:323–332, August 2012. ISSN 08936080.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov./1998. ISSN 00189219.
- SUN database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, San Francisco, CA, USA, June 2010. IEEE. ISBN 978-1-4244-6984-0.
- Reading Digits in Natural Images with Unsupervised Feature Learning. NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
- Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
- Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, October 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- CLIP-Adapter: Better Vision-Language Models with Feature Adapters, October 2021, 2110.04544.
- BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- LoRA: Low-Rank Adaptation of Large Language Models. In ICLR. arXiv, 2022b, 2106.09685.
- Three things everyone should know about Vision Transformers, March 2022, 2203.09795.
- Not all models are equal: Predicting model transferability in a self-challenging fisher space. In European Conference on Computer Vision, pages 286–302. Springer, 2022.
- How far pre-trained models are from neural collapse on the target dataset informs their transferability. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5549–5558, October 2023.
- An extensive comparative study of cluster validity indices. Pattern Recogn., 46(1):243–256, jan 2013. ISSN 0031-3203. URL https://doi.org/10.1016/j.patcog.2012.07.021.
- Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65, November 1987. ISSN 03770427.
- A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):224–227, April 1979. ISSN 0162-8828, 2160-9292.
- T. Calinski and J. Harabasz. A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1):1–27, 1974. ISSN 0361-0926.