Papers
Topics
Authors
Recent
2000 character limit reached

Modality Gap Reduction in Multimodal Systems

Updated 7 December 2025
  • Modality gap reduction is the process of minimizing representational differences between modality-specific embeddings in a shared latent space.
  • Techniques such as per-dimension normalization, batch normalization adjustments, temperature scheduling, and optimal transport alignment systematically close the modality gap.
  • These strategies enhance cross-modal retrieval, transferability, and robustness, leading to improved metrics like CIDEr, NDCG, and FID in multimodal systems.

Modality gap reduction refers to the suite of methodologies designed to minimize or eliminate the systematic representational discrepancies between different modalities—such as image, text, audio, and others—in a shared latent or embedding space. In contrastive and multimodal learning architectures, the modality gap manifests as a persistent separation between the distributions or centroids of embeddings produced by modality-specific encoders, even after contrastive alignment training. This separation impedes cross-modal retrieval, transfer, generalization, and robustness, and is increasingly recognized as a critical factor in the efficacy and safety of multimodal systems.

1. Formalization and Quantification of the Modality Gap

The modality gap is most commonly defined as the (typically Euclidean) distance between the mean (centroid) vectors of the embeddings for each modality in a shared space. For CLIP-style architectures, with vnv_n and tnt_n denoting visual and textual embeddings for item nn in Rd\mathbb{R}^d, standard metrics include:

  • Centroid gap: Δ=vt2\Delta = \|\overline{v} - \overline{t}\|_2, with v=1Nnvn\overline{v} = \frac{1}{N}\sum_n v_n and t\overline{t} likewise for text
  • Cosine gap: Δcos=En[1cos(vn,tn)]\Delta_{\mathrm{cos}} = \mathbb{E}_n[1 - \cos(v_n, t_n)]
  • Variance gap: Differences in per-dimension standard deviations across modalities (Fonseca et al., 3 Dec 2025)
  • Downstream metrics: Modality-induced bias in retrieval, such as the image-text retrieval bias ratio, mean rank disparities, or Fréchet Inception Distance (FID) between modality-wise Gaussians (Role et al., 6 May 2025)
  • Task-dependent proxies: Unsafe output rates in VLMs have been shown to highly correlate (r0.7r \approx 0.7–$0.9$) with the modality gap as measured by the Modality Integration Rate (MIR) or 2\ell_2 distances at the input to the VLM transformer (Yang et al., 30 May 2025).

The persistence of this gap is both a geometric consequence of initialization (differently aligned random output cones) and an optimization artifact, especially under contrastive learning with low temperature (Liang et al., 2022, Yaras et al., 10 Dec 2024).

2. Architectural and Statistical Correction Methods

Several approaches perform gap reduction at the embedding or architectural level, focusing on closed-form or learnable normalization or bias-shifting techniques.

Per-dimension Mean and Variance Correction

TOMCap (Fonseca et al., 3 Dec 2025) computes per-dimension means (μT\mu^T, μI\mu^I) and standard deviations (σT\sigma^T, σI\sigma^I) for each modality, then applies the affine transformation

tn[d]=(tn,dμdT)σdIσdT+μdIt'_n[d] = (t_{n,d} - \mu^T_d)\frac{\sigma^I_d}{\sigma^T_d} + \mu^I_d

to align text embeddings to the image embedding distribution, and symmetrically for the reverse. This mapping is applied as a preprocessing step, followed by injection of small Gaussian noise to ensure robustness to any residual gap. Empirical ablations demonstrate mean++std correction yields substantial improvements in downstream CIDEr, compared to mean-only or no correction.

Post-hoc Centering and BatchNorm Schemes

I0T (An et al., 18 Dec 2024) removes the per-dimension modality mean and rescales each embedding to unit norm:

xi=Normalize(xix),yi=Normalize(yiy)x'_i = \mathrm{Normalize}(x_i - \overline{x}), \quad y'_i = \mathrm{Normalize}(y_i - \overline{y})

where xix_i, yiy_i are L2-normalized image and text embeddings and x\overline{x}, y\overline{y} are modality-wise means. Alternatively, plugging two BatchNorm layers at the end of each encoder, updating only the BN parameters (not encoder weights), achieves a moderate but robust reduction in modality gap while preserving semantic geometry.

Modality-Specific Batch Normalization

In visible–infrared person re-identification, Modality Batch Normalization (MBN) (Li et al., 2021) computes normalization statistics on modality-specific sub-batches instead of the whole batch, abolishing the intra- and inter-batch distribution gaps otherwise induced by global BN. This enables the network to align activation distributions across modalities with negligible computational burden.

3. Optimization and Training-Time Strategies

Certain methods manipulate the training dynamics or explicit regularization to drive modalities closer.

Temperature Scheduling and Loss Design

Low temperature in InfoNCE-type losses creates repulsive minima between modality clusters (Liang et al., 2022, Yaras et al., 10 Dec 2024). Scheduling or fixing a moderately high temperature (τ=0.020.05\tau = 0.02-0.05) or slowing the learning rate for the temperature parameter results in a much smaller modality gap and, empirically, improved retrieval metrics.

Temperature control can be further combined with “modality swapping,” where matched pairs are randomly swapped between modalities (hard or soft), so as to force the model to treat modalities as interchangeable and push their embeddings to overlap (Yaras et al., 10 Dec 2024).

Cross-Modal Mapping and Alignment Losses

Global linear mapping with residual connections, as in Cross-Modal Mapping (CMM) (Yang et al., 28 Dec 2024), learns a transformation WW that aligns the distribution of image features to that of text features:

v=v^W+v^v' = \hat{v} W + \hat{v}

with a triplet loss to locally cluster image features around their textual prototypes. Unlike adapter- or cache-based methods, this approach is lightweight and achieves state-of-the-art few-shot and OOD performance while directly shrinking matching-pair and non-matching-pair distances.

Regularization terms (e.g., ReGap in VLM pretraining (Yang et al., 30 May 2025)) penalize the 2\ell_2 distance between projected image and text token embeddings at the first layer, with a coefficient α\alpha tuned to the initialization scale, yielding significant improvements in safety alignment without utility loss.

4. Post-Hoc Embedding-Space Alignment Methods

Recent advances have highlighted post-hoc spectral and optimal transport techniques for strong, model-agnostic reduction of the modality gap.

Spectral Alignment

Given image and text embedding matrices XX, YY, the spectral method forms a bipartite similarity graph and re-embeds all points as the lowest-kk eigenvectors of the random-walk Laplacian. This process tightly aligns image and text vectors into a shared manifold, collapsing heterogeneity indices and dramatically improving recall in retrieval (Role et al., 6 May 2025).

Optimal Transport Alignment

Laplacian-regularized optimal transport computes an assignment matrix γ\gamma^* to couple image and text distributions, regularizing by intra-modality similarity. New embeddings X~=γX\widetilde{X} = \gamma^* X, Y~=(γ)TY\widetilde{Y} = (\gamma^*)^T Y are then used for downstream tasks. This yields a large decrease in FID and mean squared gap, with competitive (though not always maximal) downstream improvements.

Mean-Shift Bias Removal

GR-CLIP (Li et al., 25 Jul 2025) observes that the primary separation between modality clusters in CLIP is a constant mean shift. Subtracting the mean vector from each modality yields immediate collapse of the gap, with negligible computational cost and superior mixed-modality retrieval.

5. Domain and Application-Specific Gap Mitigation

Numerous adaptations address the modality gap in specialized cross-modal scenarios.

  • Cross-modality transfer (e.g., audio–vision, synthetic-to-real): Meta-learning approaches such as MoNA (Ma et al., 27 Jun 2024) train transformation modules for the target modality that explicitly preserve knowledge from the source by minimizing conditional P(YX)P(Y|X) misalignment.
  • Unsupervised medical image segmentation: Global–local union alignment (Su et al., 2022) first harmonizes global content via style-transfer, and then applies local, mask-weighted feature alignment to focus adaptation on spatially hard-to-transfer features.
  • Person re-identification: Multi-level cross-modality joint alignment (MCJA) (Liang et al., 2023) employs modality alignment augmentations at the image level (randomized grayscale, CutMix, and spectrum jitter) and a rank-list-level retrieval loss to coalesce feature distributions from visible and infrared inputs.
  • Multimodal recommendation and continual learning: Weight sharing, online modality sampling, and contrastive regularization in single-branch architectures (Ganhör et al., 23 Sep 2025) lead to tightly coupled latent representations. In continual learning, controlled “gap preservation” and task-specific compensation yield superior knowledge retention and adaptability (Huang et al., 12 Jul 2025).
  • Audio–text bridging for LLM conditioning: Generative diffusion modules (Nam et al., 13 Oct 2025) map audio embeddings into the text manifold via a denoising process, regressing to text space (including topology preservation), which eliminates the audio–text gap and enables state-of-the-art zero-shot audio captioning.

6. Practical Implications, Benchmarks, and Impact on Downstream Performance

Experiments consistently demonstrate that reducing the modality gap yields improvements in retrieval, classification, safety alignment, fairness, and out-of-distribution robustness.

Method/Class Gap Reduction (%) Retrieval or Safety Gain Example Models
Post-hoc mean shift ≈99 NDCG@10 +0.26 (MixBench) GR-CLIP (Li et al., 25 Jul 2025)
Per-dim std++mean ≈99 CIDEr +24 (MSCOCO) TOMCap (Fonseca et al., 3 Dec 2025)
BN/centering (I0T_post) ≈99 T2I +9.2 pts I0T (An et al., 18 Dec 2024)
Spectral alignment ≈100 Recall@20: 0→68% Fill-the-Gap (Role et al., 6 May 2025)
OT alignment ≈85 FID: 287→35 Fill-the-Gap (Role et al., 6 May 2025)
Gap reg. (ReGap) n/a Unsafe rate –16.3 pp (Yang et al., 30 May 2025)
MCJA (MAA + CMR) strong r1/mAP +8.8/+8.3 pts (Liang et al., 2023)
Diffusion-link strong CIDEr +52.5% (zero-shot) (Nam et al., 13 Oct 2025)

Proper normalization, tuning of temperature, incorporation of intra-modal augmentations, and post-hoc re-embedding all systematically shrink the modality gap, with minimal risk of semantic collapse or performance regression when configured correctly. Overly aggressive fine-tuning that overcollapses the gap, however, can harm generalization and zero-shot transfer (Huang et al., 12 Jul 2025).

7. Theoretical Insights and Ongoing Research Directions

Theory has elucidated the gradient-flow dynamics responsible for gap persistence or slow closure, implicating both initial mismatches and self-reinforcing optimization at low temperature (Liang et al., 2022, Yaras et al., 10 Dec 2024). Key insights include:

  • The cone effect: independent initializations yield tight modality-specific embedding cones;
  • Contrastive loss at low τ\tau induces repulsion between modalities, maintained through optimization;
  • Temperature scheduling and swapping serve as robust “decouplers” of this effect, promoting shared alignment.

Open questions remain regarding the extension of current methods to more heterogeneous modalities (e.g., audio–vision–text–video), theoretical bounds for embedding distortion under normalization, and the emergence of higher-order manifold gaps (e.g., curvature misalignment). There is also active investigation into the implications of gap reduction for safety-critical VLM deployments, continual learning, and cross-domain transfer.


In sum, reducing the modality gap is essential for realizing the full semantic and operational potential of multimodal machine learning systems. Both preprocessing and training-time interventions—ranging from normalization to deep architectural innovations—enable alignment of representations, thereby improving retrieval, safety, transfer, and generalization performance while preserving the essential discriminative structure of multimodal data (Fonseca et al., 3 Dec 2025, An et al., 18 Dec 2024, Role et al., 6 May 2025, Liang et al., 2022, Liang et al., 2023, Li et al., 25 Jul 2025, Yang et al., 30 May 2025, Yang et al., 28 Dec 2024, Yaras et al., 10 Dec 2024, Lu et al., 15 Oct 2024, Ganhör et al., 23 Sep 2025, Nam et al., 13 Oct 2025, Huang et al., 12 Jul 2025, Ma et al., 27 Jun 2024, Su et al., 2022, Shen et al., 12 Dec 2024, Li et al., 2021, Mai et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Modality Gap Reduction.