Do Vision and Language Encoders Represent the World Similarly? (2401.05224v2)

Published 10 Jan 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure of vision and LLMs on image-caption benchmarks using the Centered Kernel Alignment (CKA), we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization, and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual, cross-domain caption matching and image classification. Code available at github.com/mayug/0-shot-LLM-vision.

References (48)

Citations (5)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (7)

GitHub

GitHub - mayug/0-shot-llm-vision: This repository contains the code for our CVPR 2024 paper, (11 stars)

Tweets

https://twitter.com/mayugmm/status/1764697144219300026

https://twitter.com/mayugmm/status/1925236138870022294

https://twitter.com/yacineMTB/status/1745864867607683482

https://twitter.com/mayugmm/status/1803577151272440183

https://twitter.com/semisance/status/1745409495319122159