Papers
Topics
Authors
Recent
2000 character limit reached

Modality Gap in Multi-Modal Models

Updated 4 December 2025
  • Modality gap is a geometric and statistical phenomenon in multi-modal learning where embeddings from different modalities diverge in a shared latent space.
  • It results in intra-modal ranking bias, inter-modal fusion failures, and inefficiencies in cross-modal transfer affecting the performance of diverse downstream tasks.
  • Mitigation strategies include temperature control, embedding standardization, and optimal transport mapping to effectively align modality-specific representations.

The modality gap is a geometric and statistical phenomenon observed in multi-modal representation learning, especially in contrastive vision-language and speech-LLMs. It denotes the systematic separation or misalignment between embeddings produced from distinct modalities—such as images and text, or speech and text—when projected into a putatively shared latent space. This separation manifests as clusters, offsets, or distinct cones in the embedding space, adversely impacting tasks requiring accurate cross-modal comparison, fusion, or transfer. Despite advances in pretraining techniques such as CLIP, the modality gap persists as a dominant factor shaping ranking bias, fusion failures, and knowledge transfer inefficiencies in a diverse spectrum of downstream tasks.

1. Formal Characterization of the Modality Gap

The modality gap is most commonly quantified by the offset between modality centers, intra/inter-modal similarity statistics, or divergence metrics in the latent space. For generalized contrastive models, let xiRdx_i \in \mathbb{R}^d and tiRdt_i \in \mathbb{R}^d be normalized embeddings of images and texts (or analogously, speech and text) for NN paired samples. Several canonical measurements include:

  • Centroid offset:

Δgap=1Ni=1Nxi1Ni=1Nti2\Delta_{\mathrm{gap}} = \left\|\frac{1}{N} \sum_{i=1}^N x_i - \frac{1}{N} \sum_{i=1}^N t_i \right\|_2

as in (Liang et al., 2022, Fahim et al., 28 May 2024, An et al., 18 Dec 2024, Li et al., 25 Jul 2025).

  • Cosine similarity statistics (Huang et al., 12 Jul 2025):
    • Positive pairs: pos=1Ni=1Ncos(xi,tyi)\operatorname{pos} = \frac{1}{N} \sum_{i=1}^N \cos(x_i, t_{y_i})
    • Negative pairs: neg=1Ni=1N1K1jyicos(xi,tj)\operatorname{neg} = \frac{1}{N} \sum_{i=1}^N \frac{1}{K-1} \sum_{j \neq y_i} \cos(x_i, t_j)
  • Wasserstein-2 distance for distribution alignment (Zhao et al., 3 Dec 2025):

W22(pmodc,pmodc)=μmodcμmodc22+Tr(Σmodc+Σmodc2[Σmodc1/2ΣmodcΣmodc1/2]1/2)W_2^2(p_{\mathrm{mod}^c}, p_{\mathrm{mod}'^c}) = \|\mu_{\mathrm{mod}^c} - \mu_{\mathrm{mod}'^c}\|_2^2 + \mathrm{Tr}(\Sigma_{\mathrm{mod}^c} + \Sigma_{\mathrm{mod}'^c} - 2 [\Sigma_{\mathrm{mod}^c}^{1/2} \Sigma_{\mathrm{mod}'^c} \Sigma_{\mathrm{mod}^c}^{1/2}]^{1/2})

The gap is observed empirically as nonzero in almost all modern contrastive models regardless of encoder similarity, training corpus, or modality pair (Fahim et al., 28 May 2024, Schrodi et al., 11 Apr 2024, An et al., 18 Dec 2024).

2. Origin and Dynamics of the Modality Gap

The modality gap emerges from both architectural and optimization choices:

  • Cone contraction at initialization: Independent deep encoders spontaneously map input data into tight, modality-specific cones in high-dimensional space (Liang et al., 2022).
  • Contrastive loss dynamics: InfoNCE and related contrastive objectives, especially with low temperature τ\tau, drive strong cross-modal separation to maximize hard-negative repulsion (Yaras et al., 10 Dec 2024, Fahim et al., 28 May 2024). The gap persists even with matched architectures and data.
  • Information imbalance: If one modality (e.g., text) omits semantic attributes present in the other, the cross-modal loss cannot achieve tight alignment, and the model compensates by shifting the modality clouds apart in critical dimensions (Schrodi et al., 11 Apr 2024).
  • Gradient flow coupling: Joint learning of inverse temperature and encoders couples gap closure rate to temperature dynamics; for standard exponential parameterizations, the gap closes only logarithmically in training time (Yaras et al., 10 Dec 2024).

Empirical studies confirm that the gap is nearly inevitable, exacerbated by hard negatives, mismatched data, low contrastive loss temperature, and incomplete cross-modal supervision. Fine-grained representations in speech-LLMs show alignment increases in direction but often divergence in magnitude (Xiang et al., 14 Oct 2025).

3. Effects and Consequences in Downstream Tasks

The modality gap impacts a range of downstream tasks. Notable effects include:

  • Intra-modal ranking bias: Queries preferentially retrieve same-modality items due to higher intra-modal similarities; relevant cross-modal items are suppressed (Li et al., 25 Jul 2025, Yamashita et al., 27 Nov 2025).
  • Inter-modal fusion failure: Linear or nonlinear fusion of modalities is suboptimal; multimodal documents interpolate outside semantic regions unless the gap is removed (Li et al., 25 Jul 2025).
  • Transfer inefficiency: In cross-modality transfer, a larger modality gap correlates with defective knowledge reuse and increased error when source features are applied after adaptation (Ma et al., 27 Jun 2024).
  • Semantic segmentation: Pixel-level or region-level misalignment persists when prototypes are defined in the text space rather than vision space (Xu et al., 27 Dec 2024).
  • Few-shot learning and recommendation: Class prototypes derived from text are unreliable for image feature matching unless the modality gap is closed (Yang et al., 28 Dec 2024, Ganhör et al., 23 Sep 2025).
  • Cold-start and missing-modality settings: Separately trained multi-branch models are vulnerable to missing modality; single-branch weight sharing with contrastive loss narrows intra-item gaps and improves robustness (Ganhör et al., 23 Sep 2025).
  • Speech-language tasks: Exposure bias in inference aggravates the gap, causing hidden representations for speech and text to diverge as generation proceeds (Fang et al., 2023, Liu et al., 2020, Xiang et al., 14 Oct 2025).

Notably, excessive gap closure can induce overspecialization or loss of generalization if performed indiscriminately (Huang et al., 12 Jul 2025).

4. Mitigation and Post-processing Strategies

Approaches to reduce or compensate for the modality gap are multifaceted and continue to evolve:

  • Temperature control and scheduling: Fixing, increasing, or scheduling the contrastive temperature leads to faster gap closure and better alignment (Yaras et al., 10 Dec 2024).
  • Modality-mixing: Hard or soft swapping of feature coordinates across modalities at training time breaks parallel-plane separation (Yaras et al., 10 Dec 2024).
  • Post-hoc embedding standardization: Subtracting modality-specific means and renormalizing aligns centroids in the latent space, directly minimizing the gap (An et al., 18 Dec 2024, Li et al., 25 Jul 2025, Role et al., 6 May 2025).
  • Spectral and optimal transport mapping: Spectral graph embedding or Laplacian-regularized optimal transport yield cross-modal embeddings with minimized modality separation, dramatically improving recall and balance in retrieval (Role et al., 6 May 2025).
  • Modality-gap-adaptive continual learning: MG-CLIP restricts fine-tuning epochs to preserve inter-modal geometry within a small tolerance, and adds intra-modal classifiers to recover plasticity (Huang et al., 12 Jul 2025).
  • Similarity standardization: Calibrating raw similarity scores across modalities by z-scoring with learned mean/variance or using pseudo-positive samples for zero-shot score scaling (Yamashita et al., 27 Nov 2025).
  • Orthogonal feature decoupling and coupled knowledge distillation: In multi-modal tracking and segmentation, separating style (global stats) from content (instance-normalized structure) and distilling only content leads to gap elimination (Lu et al., 15 Oct 2024, Xu et al., 27 Dec 2024).
  • Diffusion models for modality bridging: Generative mapping (Diffusion-Link) from audio to text embedding clouds yields semantic alignment and preserves text topology (Nam et al., 13 Oct 2025, Zhao et al., 3 Dec 2025).
  • Modality-adaptive ensembling and separation: Instance-level divergence estimates or discrepancy metrics can inform routing, annotation, adaptive ensemble weighting, and active data selection in UDA/ADA frameworks (Li et al., 7 Aug 2025).

Combinations of these strategies—such as trainable batch normalization layers, linear cross-modal mapping, KL-divergence regularizers, or region-level contrastive losses—have shown robust and scalable gap reduction effects across diverse domains.

5. Empirical Validation and Quantitative Impact

Large-scale experiments across vision, language, speech, audio, SAR-optical, and recommendation domains confirm the significance of the modality gap and the benefits of principled bridging:

Application/Model Modality Gap Impact (Metric) Gap Closure Method Empirical Gains Reference
CLIP retrieval Recall@20/0 for image→text Spectral/OT embedding Recall@20 > 80% (Role et al., 6 May 2025)
Mixed search NDCG@10 up to +26pp over baseline Mean-centering >4pp NDCG gain, 75× compute ↓ (Li et al., 25 Jul 2025)
Class-incremental Negative pair similarity drops MG-CLIP (MGP+MGC) +5.64pp Last acc, 1.33pp zero-shot (Huang et al., 12 Jul 2025)
Region segmentation mIoU gain of +4.8 VPL + region contrast Gap metric shrinks from 0.76→0.51 (Xu et al., 27 Dec 2024)
Speech translation Representation gap G(s ‖ x) ↑ Scheduled sampling + KL +1.7 BLEU avg (MuST-C 8-way) (Fang et al., 2023)
Audio captioning Cosine sim up from 0.486→0.688 Diffusion-Link +52.5% CIDEr, +7.5% in supervised (Nam et al., 13 Oct 2025)
Recommendation Intra-item CS↑, prediction-acc↓ Single-branch, InfoNCE +10-50% NDCG cold/missing-modality (Ganhör et al., 23 Sep 2025)
ReID (Optical-SAR) R1 accuracy +16.4pp S→O MCRL + fusion/diffusion Diag-W2 loss, BBDM fusion (Zhao et al., 3 Dec 2025)

A plausible implication is that future multi-modal models—whether for transfer, retrieval, translation, or adaptation—should robustly monitor and correct for the modality gap during both pretraining and fine-tuning, leveraging the growing suite of calibration and bridging methods now available from recent literature.

6. Open Problems and Future Research Directions

Gap closure is not always unconditionally beneficial. Overspecialization, loss of pre-trained knowledge, and generalization degradation can result from naive alignment (Huang et al., 12 Jul 2025, Schrodi et al., 11 Apr 2024). Current limitations include:

  • Residual imbalance in dynamic systems: Modality gaps may re-emerge as retrieval databases or input domains drift (Yamashita et al., 27 Nov 2025).
  • Complex multimodal entanglement: Full covariance structure and higher-order moments are often neglected; future work may explore non-linear realignment and large-scale OT methods for covariance matching (Role et al., 6 May 2025, Zhao et al., 3 Dec 2025).
  • Marginal distribution shift: Instance-level divergence and adaptive weighting remain open in highly heterogeneous or low-resource datasets (Li et al., 7 Aug 2025, Ma et al., 27 Jun 2024).
  • Unsupervised and cross-domain adaptation: Extending current gap metrics and bridging methods to multi-language, audio-visual, or multi-sensor environments is an active research target.
  • Efficient online calibration: Lightweight, incremental estimators for modal means/variances in streaming or evolving collections remain underexplored (Yamashita et al., 27 Nov 2025).
  • Theoretical limits on gap closure: The optimal trade-off between modality gap magnitude and downstream performance, bias minimization, and fairness is not yet fully characterized (Liang et al., 2022, Schrodi et al., 11 Apr 2024, Fahim et al., 28 May 2024).

7. Conceptual Significance and Broader Connections

The modality gap is not merely a technical obstacle but encodes deeper properties of contrastive and multi-modal learning architectures:

  • It reflects the retained semantic knowledge and memory of the source modality post-adaptation (Huang et al., 12 Jul 2025, Ma et al., 27 Jun 2024).
  • It mediates generalization, fairness, and transfer efficiency; both insufficient and excessive gap can directly impact bias or discrimination (Liang et al., 2022, Schrodi et al., 11 Apr 2024).
  • Its ablation, quantification, and calibration force the reexamination of fusion and alignment strategies, template and prompt selection, and the design of future cross-modal backbone architectures.

Continued theoretical, algorithmic, and empirical inquiry into the modality gap is essential for robust, generalizable, and cross-domain machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Modality Gap.