Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning (2505.03703v1)

Published 6 May 2025 in cs.CV and cs.LG

Abstract: Vision-LLMs (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the embeddings from one modality and another in the embedding space. While this misalignment is detrimental for downstream tasks such as multimodal retrieval, multimodal clustering or zero-shot classification, etc. no generic and practical methods have so far been proposed to assess it precisely and even reduce it. We therefore propose novel measures and effective techniques (spectral- and optimal transport-based methods) to achieve this goal. Extensive experiments conducted on several image-text datasets and models demonstrate their effectiveness and beneficial effects on downstream tasks. Our code is available at the URL provided in the paper's abstract.

Summary

An Examination of Modality Gaps in Image-Text Representation Learning

The challenge of effectively integrating image and text data into shared representation spaces is a pivotal concern in the field of vision-LLMs (VLMs). VLMs, which are adept at creating joint embeddings for different modalities, have shown promise in tasks such as multimodal retrieval, clustering, and zero-shot classification. However, a significant obstacle that has been identified is the "modality gap" phenomenon. This paper addresses the gap with a comprehensive paper, introducing novel methodologies grounded in spectral analysis and optimal transport to both quantify and ameliorate the issue.

Background and Motivation

The past decade has seen major advances in aligning multimodal embeddings, primarily heralded by models like CLIP, which leverages contrastive learning to marry image and text representations. Nonetheless, these relationships often bear imperfect alignment in their latent spaces, leading to a distinct modality gap—where embeddings from texts and images are manifestly separated. The paper emphasizes that no existing methodologies adequately quantify or reduce this gap, a shortcoming that undermines the potential of VLMs in fully realizing robust multimodal interactions.

Contribution and Methodologies

The researchers propose two distinct methodologies aimed at reducing the modality gap: spectral techniques and optimal transport.

Spectral Techniques: These methods deploy graph-based techniques that utilize the spectral properties of the graph Laplacian associated with adjacency matrices of image-text embeddings. By focusing on eigenvectors associated with the smallest nonzero eigenvalues, spectral techniques aim to project nodes (i.e., data points) into a shared space where similarities are emphasized, thereby potentially narrowing the modality gap.
Optimal Transport: A separate approach involves optimal transport, a strategy based on transforming one distribution into another with minimized cost under certain constraints. The paper employs Laplacian regularization to optimize the transportation of embeddings between source (image) and target (text) distributions while respecting local structural constraints.

Experimental Validation

The efficacy of these methodologies is examined through rigorous experimentation involving model families like CLIP, SigLIP, and LLM2CLIP, evaluated over datasets such as COCO and Conceptual Captions. Several metrics are introduced to measure improvements quantitatively:

Heterogeneity Indices (ITR and TIR): These metrics gauge the propensity of queries to retrieve results biased towards their modalities, a direct measure of modality gap impact.
Fréchet Inception Distance (FID): Often applied in generative model evaluations, FID assesses the statistical closeness between the distribution of image and text embeddings post-intervention.

The results reveal a substantial reduction in modality biases when applying both spectral and optimal transport methods. Importantly, spectral techniques demonstrate particular promise, delivering notably improved recall in retrieval tasks compared to original embeddings, and enabling closer, semantically coherent proximity between image-text embeddings.

Implications and Future Directions

This work advances the field by providing a set of generalizable, model-agnostic methods to measure and mitigate modality gaps. The introduction of new metrics and embedding transformation techniques paves the way for enhanced alignment in multimodal representation spaces. Future research could expand on these foundations, refining computational efficiencies and integrating more sophisticated models and datasets for further benchmarking. Moreover, understanding the theoretical underpinnings of these methods' successes could illuminate further improvements in alignment strategies.

Overall, this paper makes a significant contribution to multimodal learning, offering researchers effective tools and insights into overcoming one of the key challenges hindering the full potential of vision-LLMs.