Modality Gap Reduction in Multimodal Systems

Updated 7 December 2025

Modality gap reduction is the process of minimizing representational differences between modality-specific embeddings in a shared latent space.
Techniques such as per-dimension normalization, batch normalization adjustments, temperature scheduling, and optimal transport alignment systematically close the modality gap.
These strategies enhance cross-modal retrieval, transferability, and robustness, leading to improved metrics like CIDEr, NDCG, and FID in multimodal systems.

Modality gap reduction refers to the suite of methodologies designed to minimize or eliminate the systematic representational discrepancies between different modalities—such as image, text, audio, and others—in a shared latent or embedding space. In contrastive and multimodal learning architectures, the modality gap manifests as a persistent separation between the distributions or centroids of embeddings produced by modality-specific encoders, even after contrastive alignment training. This separation impedes cross-modal retrieval, transfer, generalization, and robustness, and is increasingly recognized as a critical factor in the efficacy and safety of multimodal systems.

1. Formalization and Quantification of the Modality Gap

The modality gap is most commonly defined as the (typically Euclidean) distance between the mean (centroid) vectors of the embeddings for each modality in a shared space. For CLIP-style architectures, with $v_n$ and $t_n$ denoting visual and textual embeddings for item $n$ in $\mathbb{R}^d$ , standard metrics include:

Centroid gap: $\Delta = \|\overline{v} - \overline{t}\|_2$ , with $\overline{v} = \frac{1}{N}\sum_n v_n$ and $\overline{t}$ likewise for text
Cosine gap: $\Delta_{\mathrm{cos}} = \mathbb{E}_n[1 - \cos(v_n, t_n)]$
Variance gap: Differences in per-dimension standard deviations across modalities (Fonseca et al., 3 Dec 2025)
Downstream metrics: Modality-induced bias in retrieval, such as the image-text retrieval bias ratio, mean rank disparities, or Fréchet Inception Distance (FID) between modality-wise Gaussians (Role et al., 6 May 2025)
Task-dependent proxies: Unsafe output rates in VLMs have been shown to highly correlate ( $r \approx 0.7$ –$0.9$) with the modality gap as measured by the Modality Integration Rate (MIR) or $\ell_2$ distances at the input to the VLM transformer (Yang et al., 30 May 2025).

The persistence of this gap is both a geometric consequence of initialization (differently aligned random output cones) and an optimization artifact, especially under contrastive learning with low temperature (Liang et al., 2022, Yaras et al., 2024).

2. Architectural and Statistical Correction Methods

Several approaches perform gap reduction at the embedding or architectural level, focusing on closed-form or learnable normalization or bias-shifting techniques.

Per-dimension Mean and Variance Correction

TOMCap (Fonseca et al., 3 Dec 2025) computes per-dimension means ( $\mu^T$ , $\mu^I$ ) and standard deviations ( $\sigma^T$ , $\sigma^I$ ) for each modality, then applies the affine transformation

$t'_n[d] = (t_{n,d} - \mu^T_d)\frac{\sigma^I_d}{\sigma^T_d} + \mu^I_d$

to align text embeddings to the image embedding distribution, and symmetrically for the reverse. This mapping is applied as a preprocessing step, followed by injection of small Gaussian noise to ensure robustness to any residual gap. Empirical ablations demonstrate mean $+$ std correction yields substantial improvements in downstream CIDEr, compared to mean-only or no correction.

Post-hoc Centering and BatchNorm Schemes

I0T (An et al., 2024) removes the per-dimension modality mean and rescales each embedding to unit norm:

$x'_i = \mathrm{Normalize}(x_i - \overline{x}), \quad y'_i = \mathrm{Normalize}(y_i - \overline{y})$

where $x_i$ , $y_i$ are L2-normalized image and text embeddings and $\overline{x}$ , $\overline{y}$ are modality-wise means. Alternatively, plugging two BatchNorm layers at the end of each encoder, updating only the BN parameters (not encoder weights), achieves a moderate but robust reduction in modality gap while preserving semantic geometry.

Modality-Specific Batch Normalization

In visible–infrared person re-identification, Modality Batch Normalization (MBN) (Li et al., 2021) computes normalization statistics on modality-specific sub-batches instead of the whole batch, abolishing the intra- and inter-batch distribution gaps otherwise induced by global BN. This enables the network to align activation distributions across modalities with negligible computational burden.

3. Optimization and Training-Time Strategies

Certain methods manipulate the training dynamics or explicit regularization to drive modalities closer.

Temperature Scheduling and Loss Design

Low temperature in InfoNCE-type losses creates repulsive minima between modality clusters (Liang et al., 2022, Yaras et al., 2024). Scheduling or fixing a moderately high temperature ( $\tau = 0.02-0.05$ ) or slowing the learning rate for the temperature parameter results in a much smaller modality gap and, empirically, improved retrieval metrics.

Temperature control can be further combined with “modality swapping,” where matched pairs are randomly swapped between modalities (hard or soft), so as to force the model to treat modalities as interchangeable and push their embeddings to overlap (Yaras et al., 2024).

Global linear mapping with residual connections, as in Cross-Modal Mapping (CMM) (Yang et al., 2024), learns a transformation $W$ that aligns the distribution of image features to that of text features:

$v' = \hat{v} W + \hat{v}$

with a triplet loss to locally cluster image features around their textual prototypes. Unlike adapter- or cache-based methods, this approach is lightweight and achieves state-of-the-art few-shot and OOD performance while directly shrinking matching-pair and non-matching-pair distances.

Regularization terms (e.g., ReGap in VLM pretraining (Yang et al., 30 May 2025)) penalize the $\ell_2$ distance between projected image and text token embeddings at the first layer, with a coefficient $\alpha$ tuned to the initialization scale, yielding significant improvements in safety alignment without utility loss.

4. Post-Hoc Embedding-Space Alignment Methods

Recent advances have highlighted post-hoc spectral and optimal transport techniques for strong, model-agnostic reduction of the modality gap.

Spectral Alignment

Given image and text embedding matrices $X$ , $Y$ , the spectral method forms a bipartite similarity graph and re-embeds all points as the lowest- $k$ eigenvectors of the random-walk Laplacian. This process tightly aligns image and text vectors into a shared manifold, collapsing heterogeneity indices and dramatically improving recall in retrieval (Role et al., 6 May 2025).

Optimal Transport Alignment

Laplacian-regularized optimal transport computes an assignment matrix $\gamma^*$ to couple image and text distributions, regularizing by intra-modality similarity. New embeddings $\widetilde{X} = \gamma^* X$ , $\widetilde{Y} = (\gamma^*)^T Y$ are then used for downstream tasks. This yields a large decrease in FID and mean squared gap, with competitive (though not always maximal) downstream improvements.

Mean-Shift Bias Removal

GR-CLIP (Li et al., 25 Jul 2025) observes that the primary separation between modality clusters in CLIP is a constant mean shift. Subtracting the mean vector from each modality yields immediate collapse of the gap, with negligible computational cost and superior mixed-modality retrieval.

5. Domain and Application-Specific Gap Mitigation

Numerous adaptations address the modality gap in specialized cross-modal scenarios.

Cross-modality transfer (e.g., audio–vision, synthetic-to-real): Meta-learning approaches such as MoNA (Ma et al., 2024) train transformation modules for the target modality that explicitly preserve knowledge from the source by minimizing conditional $P(Y|X)$ misalignment.
Unsupervised medical image segmentation: Global–local union alignment (Su et al., 2022) first harmonizes global content via style-transfer, and then applies local, mask-weighted feature alignment to focus adaptation on spatially hard-to-transfer features.
Person re-identification: Multi-level cross-modality joint alignment (MCJA) (Liang et al., 2023) employs modality alignment augmentations at the image level (randomized grayscale, CutMix, and spectrum jitter) and a rank-list-level retrieval loss to coalesce feature distributions from visible and infrared inputs.
Multimodal recommendation and continual learning: Weight sharing, online modality sampling, and contrastive regularization in single-branch architectures (Ganhör et al., 23 Sep 2025) lead to tightly coupled latent representations. In continual learning, controlled “gap preservation” and task-specific compensation yield superior knowledge retention and adaptability (Huang et al., 12 Jul 2025).
Audio–text bridging for LLM conditioning: Generative diffusion modules (Nam et al., 13 Oct 2025) map audio embeddings into the text manifold via a denoising process, regressing to text space (including topology preservation), which eliminates the audio–text gap and enables state-of-the-art zero-shot audio captioning.

6. Practical Implications, Benchmarks, and Impact on Downstream Performance

Experiments consistently demonstrate that reducing the modality gap yields improvements in retrieval, classification, safety alignment, fairness, and out-of-distribution robustness.

Method/Class	Gap Reduction (%)	Retrieval or Safety Gain	Example Models
Post-hoc mean shift	≈99	NDCG@10 +0.26 (MixBench)	GR-CLIP (Li et al., 25 Jul 2025)
Per-dim std $+$ mean	≈99	CIDEr +24 (MSCOCO)	TOMCap (Fonseca et al., 3 Dec 2025)
BN/centering (I0T_post)	≈99	T2I +9.2 pts	I0T (An et al., 2024)
Spectral alignment	≈100	Recall@20: 0→68%	Fill-the-Gap (Role et al., 6 May 2025)
OT alignment	≈85	FID: 287→35	Fill-the-Gap (Role et al., 6 May 2025)
Gap reg. (ReGap)	n/a	Unsafe rate –16.3 pp	(Yang et al., 30 May 2025)
MCJA (MAA + CMR)	strong	r1/mAP +8.8/+8.3 pts	(Liang et al., 2023)
Diffusion-link	strong	CIDEr +52.5% (zero-shot)	(Nam et al., 13 Oct 2025)

Proper normalization, tuning of temperature, incorporation of intra-modal augmentations, and post-hoc re-embedding all systematically shrink the modality gap, with minimal risk of semantic collapse or performance regression when configured correctly. Overly aggressive fine-tuning that overcollapses the gap, however, can harm generalization and zero-shot transfer (Huang et al., 12 Jul 2025).

7. Theoretical Insights and Ongoing Research Directions

Theory has elucidated the gradient-flow dynamics responsible for gap persistence or slow closure, implicating both initial mismatches and self-reinforcing optimization at low temperature (Liang et al., 2022, Yaras et al., 2024). Key insights include:

The cone effect: independent initializations yield tight modality-specific embedding cones;
Contrastive loss at low $\tau$ induces repulsion between modalities, maintained through optimization;
Temperature scheduling and swapping serve as robust “decouplers” of this effect, promoting shared alignment.

Open questions remain regarding the extension of current methods to more heterogeneous modalities (e.g., audio–vision–text–video), theoretical bounds for embedding distortion under normalization, and the emergence of higher-order manifold gaps (e.g., curvature misalignment). There is also active investigation into the implications of gap reduction for safety-critical VLM deployments, continual learning, and cross-domain transfer.

In sum, reducing the modality gap is essential for realizing the full semantic and operational potential of multimodal machine learning systems. Both preprocessing and training-time interventions—ranging from normalization to deep architectural innovations—enable alignment of representations, thereby improving retrieval, safety, transfer, and generalization performance while preserving the essential discriminative structure of multimodal data (Fonseca et al., 3 Dec 2025, An et al., 2024, Role et al., 6 May 2025, Liang et al., 2022, Liang et al., 2023, Li et al., 25 Jul 2025, Yang et al., 30 May 2025, Yang et al., 2024, Yaras et al., 2024, Lu et al., 2024, Ganhör et al., 23 Sep 2025, Nam et al., 13 Oct 2025, Huang et al., 12 Jul 2025, Ma et al., 2024, Su et al., 2022, Shen et al., 2024, Li et al., 2021, Mai et al., 2019).

Markdown Upgrade to Chat

References (18)

Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction (2025)

Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning (2025)

Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap (2025)

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning (2022)

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning (2024)

I0T: Embedding Standardization Method Towards Zero Modality Gap (2024)

Bridging the Distribution Gap of Visible-Infrared Person Re-identification with Modality Batch Normalization (2021)

Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification (2024)

Closing the Modality Gap for Mixed Modality Search (2025)

10.

Learning Modality Knowledge Alignment for Cross-Modality Transfer (2024)

11.

Mind The Gap: Alleviating Local Imbalance for Unsupervised Cross-Modality Medical Image Segmentation (2022)

12.

Bridging the Gap: Multi-Level Cross-Modality Joint Alignment for Visible-Infrared Person Re-Identification (2023)

13.

Single-Branch Network Architectures to Close the Modality Gap in Multimodal Recommendation (2025)

14.

Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning (2025)

15.

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap (2025)

16.

Breaking Modality Gap in RGBT Tracking: Coupled Knowledge Distillation (2024)

17.

Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning (2024)

18.

Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modality Gap Reduction.