To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

Do Vision and Language Encoders Represent the World Similarly?

This presentation explores research into the inherent semantic alignment between independently trained vision and language models. By leveraging Centered Kernel Alignment (CKA) and graph matching techniques, the authors demonstrate that unaligned encoders can communicate across modalities without any joint training, achieving zero-shot image-caption retrieval and classification through structural similarity alone.

Script

If you train a vision model on images and a language model on text separately, do they end up seeing the world in the same way? The researchers explore a fascinating possibility: that semantic structures might emerge identically across different modalities even without explicit alignment training.

Building on this curiosity, the authors investigate if vision and language encoders are fundamentally similar by looking at their latent representation spaces. They use Centered Kernel Alignment, or CKA, to measure if the inner structures of these different models actually mirror one another.

To understand how this similarity is calculated, let's look at the core methodology of the study.

The researchers treat image and caption embeddings as inducing graphs that represent the world's structure. They discovered that the highest CKA score always occurs when images and text are correctly paired, providing a signal for alignment without any training.

This diagram illustrates the process of maximizing CKA through graph matching. By solving a Quadratic Assignment Problem, the authors can discover the optimal permutation to match a shuffled set of captions to their corresponding images based solely on geometric structure.

Moving from theory to practice, they propose two methods for communication: QAP for global matching and Local CKA for retrieving individual pairs. They also introduce feature stretching to normalize the data, ensuring the kernel comparison is as accurate as possible.

Here we see that the quality and scale of training data directly correlate with both representation similarity and matching accuracy. Even self-supervised models like DINOv2 show high semantic similarity with language models, though language-supervised vision models tend to align even more closely.

In experimental tests, the methods showed remarkable performance, particularly in cross-lingual retrieval. Local CKA actually outperformed standard CLIP retrieval when dealing with non-Latin scripts, because these structural methods don't suffer from the same tokenization issues as joint models.

Despite these successes, the authors acknowledge that these algorithms are computationally heavy, with Local CKA requiring more processing time than simple linear baselines. They suggest that future work focusing on CUDA optimization could make this training-free communication more practical for real-time use.

In conclusion, this research proves that semantic structures emerge naturally across modalities, allowing unaligned AI models to find common ground through pure geometry. For more deep dives into this and other research, visit EmergentMind.com. Unaligned vision and language encoders can communicate effectively by simply maximizing the structural similarity of their representation spaces.