Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap (2405.18570v3)

Published 28 May 2024 in cs.CV, cs.CL, cs.IR, and cs.LG

Abstract: Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Mohammad Al-Jaff. 2023. Messing With The Gap: On The Modality Gap Phenomenon In Multimodal Contrastive Representation Learning. Ph. D. Dissertation. Uppsala University.
  2. A Simple Framework for Contrastive Learning of Visual Representations. http://arxiv.org/abs/2002.05709 arXiv:2002.05709 [cs, stat].
  3. Describing Textures in the Wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  4. Embedding Arithmetic for text-driven Image Transformation. arXiv preprint arXiv:2112.03162 (2021).
  5. CyCLIP: Cyclic Contrastive Language-Image Pretraining. http://arxiv.org/abs/2205.14459 arXiv:2205.14459 [cs].
  6. Steven M Holland. 2008. Principal components analysis (PCA). Department of Geology, University of Georgia, Athens, GA 30602 (2008), 2501.
  7. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. arXiv:2102.05918 [cs.CV]
  8. Learning multiple layers of features from tiny images. (2009).
  9. Caltech 101. https://doi.org/10.22002/D1.20086
  10. Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. http://arxiv.org/abs/2203.02053 arXiv:2203.02053 [cs].
  11. Microsoft COCO: Common Objects in Context. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755.
  12. BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity. CoRR abs/2310.04420 (2023). https://doi.org/10.48550/ARXIV.2310.04420 arXiv:2310.04420
  13. Improving neural network representations using human similarity judgments. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (Eds.). http://papers.nips.cc/paper_files/paper/2023/hash/9febda1c8344cc5f2d51713964864e93-Abstract-Conference.html
  14. Geodesic Multi-Modal Mixup for Robust Fine-Tuning. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=iAAXq60Bw1
  15. Learning Transferable Visual Models From Natural Language Supervision. (2021).
  16. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211–252.
  17. Tongzhou Wang and Phillip Isola. 2020. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. (2020).
  18. Michael Welle. 2023. UNDERSTANDING THE MODALITY GAP IN CLIP. (2023).
  19. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv:1910.03771 [cs.CL]
  20. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6787–6800. https://doi.org/10.18653/v1/2021.emnlp-main.544
  21. Contrastive Learning of Medical Visual Representations from Paired Images and Text. https://openreview.net/forum?id=T4gXBOXoIUr
  22. CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation. arXiv:2210.03919 [cs.CV]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Abrar Fahim (4 papers)
  2. Alex Murphy (8 papers)
  3. Alona Fyshe (23 papers)
Citations (2)