Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion (2502.04263v1)

Published 6 Feb 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Pre-trained multi-modal Vision-LLMs like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intra-modal baselines on more than fifteen datasets. Additionally, we demonstrate that approaching a native inter-modal task (e.g. zero-shot image classification) intra-modally decreases performance, further validating our findings. Finally, we show that incorporating an intra-modal term in the pre-training objective or narrowing the modality gap between the text and image feature embedding spaces helps reduce the intra-modal misalignment. The code is publicly available at: https://github.com/miccunifi/Cross-the-Gap.

Summary

The paper identifies CLIP’s intra-modal misalignment caused by its inter-modal contrastive learning approach.
The paper introduces optimization-based textual and visual inversion methods to map native features across modalities.
The paper demonstrates that using inter-modal strategies for intra-modal tasks significantly boosts retrieval performance across varied datasets.

Insights into Intra-Modal Misalignment in CLIP Models

The paper "Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion" rigorously investigates a limitation inherent to widely-used Vision-LLMs (VLMs) such as CLIP, which is their intra-modal misalignment. The authors reveal that the common strategy of leveraging pre-trained VLM encoders for intra-modal tasks, like image-to-image or text-to-text retrieval, does not fully utilize the model’s potential due to misaligned intra-modal similarity.

CLIP, a notable VLM, is trained with an inter-modal contrastive loss whose primary aim is to maximize the similarity between paired image-text samples. This mechanism implicitly forms a modality gap, where visual and textual data clusters in separate subspaces of the shared embedding space. Consequently, the intra-modal similarity scores among images or texts in CLIP do not accurately reflect the actual semantic similarities.

Modality Inversion Approach

To address this misalignment, the paper introduces and employs modality inversion techniques—Optimization-based Textual Inversion (OTI) and Optimization-based Visual Inversion (OVI). These methods map objects in their native modality to their corresponding representations in the complementary modality. Importantly, they do so at a single-feature level using only the pre-trained encoders, thereby minimizing dependency on external data. This transformation uncovers the advantages hidden in the inter-modal aspects of CLIP.

Empirical Analysis

The authors conduct comprehensive experiments across various datasets and VLM frameworks, including OpenAI CLIP, OpenCLIP, and SigLIP. The paper shows that handling intra-modal tasks (e.g., image-to-image retrieval) in an inter-modal manner via modality inversion not only aligns with CLIP’s original training objective but also surpasses intra-modal baselines in performance.

For instance, the paper offers empirical evidence showing that OTI transforms imagery tasks into inter-modal ones, which substantially boosts the mean Average Precision (mAP) across datasets like Stanford Cars and Oxford Flowers. This finding generalizes across multiple architectures and scales with CLIP's version and backbone.

Conversely, transforming inter-modal tasks like zero-shot image classification into intra-modal comparisons yields diminished effectiveness, further corroborating the central thesis regarding the CLIP format's innate inter-modal proficiencies.

Theoretical and Practical Implications

Theoretically, these insights prompt a re-evaluation of how embedding spaces can be optimized during model pre-training. The authors suggest that incorporating intra-modal losses or closing the modality gap could alleviate misalignment. Pragmatically, their findings advocate for adapting task approaches to exploit VLMs’ strengths fully.

Moreover, the inclusion of alternative models such as SLIP, which incorporates a self-supervised intra-modal component, demonstrates that addressing intra-modal misalignment at training time can significantly balance and enhance VLM performance.

Future Directions

The paper strongly opens the field for further exploration into efficient, data-independent solutions for modality inversion. The current computational demand of the proposed techniques signals a need for approaches that retain efficacy while reducing overhead, placing emphasis on bridging modalities without costly iterative processes.

In conclusion, "Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion” provides a compelling evaluation of CLIP’s intra-modal challenges and offers a substantive contribution towards more effective utilization of VLMs in multi-modal tasks. The research contributes valuable guidance on refining existing models' training procedures and adapting deployment strategies to consider inter-modal prowess.

PDF Markdown

GitHub

GitHub - miccunifi/Cross-the-Gap: [ICLR 2025] - Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion (4 stars)