Fixed-Frame Modality Gap Theory

Updated 4 July 2026

Fixed-frame Modality Gap Theory is a framework that quantifies modality mismatches in fixed encoder or latent spaces by measuring discrepancies like centroid offsets and spatiotemporal divergences.
It formalizes these gaps with mathematical metrics and calibration techniques across vision–language, audio–text, and speech domains to standardize modality comparisons.
Empirical studies indicate that controlling the modality gap can enhance retrieval accuracy, robustness, and dense prediction performance in multimodal systems.

Fixed-frame Modality Gap Theory denotes a family of formalizations in which modality mismatch is analyzed relative to a fixed reference frame: a frozen encoder output space, a shared latent coordinate system, or a task instance whose semantics, instructions, and evaluation remain unchanged while only the input modality varies. In this setting, the modality gap becomes a measurable discrepancy between modality-conditional distributions, centroids, spatiotemporal statistics, or internal trajectories. The concept appears in contrastive vision–language learning, audio–text alignment, multimodal LLMs, speech LLMs, speech translation, mixed-modality retrieval, and frame–event optical flow, where it is used to explain why semantically matched inputs can remain systematically offset even after joint training, and why closing or controlling that offset can alter robustness, grouping behavior, retrieval, reasoning, or dense prediction quality (Yu et al., 2 Feb 2026, Nam et al., 13 Oct 2025, Sun et al., 10 Mar 2026, Zhou et al., 10 Mar 2025).

1. Conceptual scope and fixed-frame conditions

The central premise is that modality mismatch is most intelligible when the comparison frame is held fixed. In frozen-encoder audio–text work, the output space of a multimodal encoder is treated as an immutable representational frame, and the gap is the discrepancy between the empirical audio and text embedding distributions in that space (Nam et al., 13 Oct 2025). In text-as-image studies, the fixed-frame condition means that semantic content, instructions, and expected outputs are held constant while only the content modality changes from text tokens to rendered pixels, so the gap is defined directly as the performance difference between text mode and image mode (Sun et al., 10 Mar 2026). In multimodal contrastive learning, the shared latent coordinate system learned by CLIP-like models is taken as a common reference frame in which modality-conditional centroids and pairwise similarities can be compared (Grassucci et al., 26 Jan 2026, Liang et al., 2022). In event-based optical flow, the frame is a common spatiotemporal gradient representation derived from fixed-frame images and accumulated event signals, allowing heterogeneous measurements to be related through physically meaningful equalizations (Zhou et al., 10 Mar 2025).

This fixed-frame viewpoint is not confined to static embeddings. In speech LLMs, the relevant frame can be a layer-by-layer hidden-state trajectory inside a frozen text-native decoder, so the gap is expressed as dynamic divergence between speech-conditioned and text-conditioned internal states rather than as a purely static geometric offset (Hsu et al., 2 Mar 2026, Wang et al., 9 Jan 2026). In continual learning with CLIP, the pre-trained modality gap itself functions as a fixed geometric reference whose preservation is treated as a proxy for preserving pre-trained knowledge (Huang et al., 12 Jul 2025). In medical vision–LLMs, the fixed frame is the frozen pair of encoders together with the empirical centroid displacement between modalities; the gap then becomes a tunable geometric property rather than a quantity assumed to be universally minimized (Restrepo et al., 18 Mar 2026).

A common implication is that “gap” is not a single object. Depending on the task, it may denote a distributional discrepancy, a centroid offset, an angular separation, a similarity-scale mismatch, a difference in spatiotemporal correlation structure, or a divergence in reasoning trajectories. The unifying element is not the metric itself, but the decision to evaluate modality mismatch relative to a frame whose coordinates, semantics, or dynamics are fixed for analysis.

2. Mathematical formalizations

Several recurring formalizations capture the fixed-frame perspective.

Setting	Fixed frame	Gap quantity
Frozen audio–text embeddings	Encoder output space	$D(P_A, P_T)$ , cosine separation, $L_{topo}$ (Nam et al., 13 Oct 2025)
Text tokens vs rendered pixels	Same task instance	$g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ (Sun et al., 10 Mar 2026)
CLIP-like shared latent space	Shared normalized embedding space	$\|\|c^m-c^n\|\|_2$ , $CosTP_{m,n}$ , RMG (Grassucci et al., 26 Jan 2026)
Contrastive robustness analysis	Orthogonal modality axis in shared space	$g=\mu_y-\mu_x$ , $\Delta=\|\|g\|\|$ (Chowers et al., 30 Mar 2026)
Bias–residual decomposition	Fixed subspaces $U$ and $V$	$\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ (Yu et al., 2 Feb 2026)
Frame–event optical flow	Common spatiotemporal gradient map	$L_{topo}$ 0 and $L_{topo}$ 1 (Zhou et al., 10 Mar 2025)

In frozen audio–text alignment, the discrepancy functional $L_{topo}$ 2 measures the gap between modality-conditional distributions in $L_{topo}$ 3, while cosine similarity for matched and unmatched pairs acts as a geometric probe, and the topology loss

$L_{topo}$ 4

matches within-batch cosine structure between generated text-like embeddings and ground-truth text embeddings (Nam et al., 13 Oct 2025). In text-as-image evaluation, the fixed-frame definition is explicitly operational: for a model $L_{topo}$ 5 and dataset $L_{topo}$ 6, the modality gap is the difference between text-mode and image-mode accuracy under identical semantics and evaluation (Sun et al., 10 Mar 2026).

In CLIP-style latent spaces, centroid distance and true-pair cosine are standard summary statistics. One formulation uses

$L_{topo}$ 7

with $L_{topo}$ 8, along with $L_{topo}$ 9, to characterize how much of the shared space is actually shared (Grassucci et al., 26 Jan 2026). A related robustness analysis defines a global gap vector

$g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ 0

and studies the case where this vector is approximately orthogonal to the content subspace, so that modalities differ by a translation along a fixed axis (Chowers et al., 30 Mar 2026). A more refined geometric account decomposes the instantaneous gap in a fixed reference frame into a principal modality bias $g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ 1, a constant orthogonal bias $g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ 2, and anisotropic residuals $g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ 3 and $g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ 4, separating first-order offsets from second-order structure (Yu et al., 2 Feb 2026).

Outside contrastive embeddings, fixed-frame theory also appears in signal domains. In high-dynamic optical flow, frames and events are mapped into a common spatiotemporal gradient space through

$g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ 5

for frames and

$g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ 6

for events, enabling similarity distributions in the two modalities to be compared in a shared physically grounded representation (Zhou et al., 10 Mar 2025). In speech translation, the gap is measured on target-side decoder states as $g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ 7, and regularized at the distribution level by a symmetric KL between speech-conditioned and text-conditioned token distributions (Fang et al., 2023).

3. Origins and mechanisms

One major line of explanation attributes the gap to geometry induced by contrastive learning and initialization. “Mind the Gap” showed that deep networks exhibit a cone effect at initialization, so independently initialized image and text encoders begin by mapping inputs into distinct narrow cones; low temperature in the contrastive objective then preserves a nonzero separation between modalities (Liang et al., 2022). “Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning” refined this by analyzing gradient flow under a learnable inverse temperature $g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ 8: random initialization and early mismatches enlarge the modality gap, while later $g_M(D)=Acc_M(text(D))-Acc_M(image(D))$ 9 growth causes the modality-separating direction to decay very slowly, with $||c^m-c^n||_2$ 0 under the stated conditions (Yaras et al., 2024). A sharper critique came from “It’s Not a Modality Gap: Characterizing and Addressing the Contrastive Gap,” which showed that even when both towers process the same modality, share the same initialization-aligned cone, and use identical images as positives, training to zero contrastive loss still produced a large separation, leading to the claim that the gap is inherent to the two-encoder contrastive objective rather than strictly to modality itself (Fahim et al., 2024).

A second line of work emphasizes heterogeneous sensing, sampling, and tokenization. In high-dynamic optical flow, fixed-frame imaging introduces spatial blur from long exposure and temporal discontinuity from low frame rate, whereas event sensing provides asynchronous, boundary-focused, temporally continuous but spatially sparse measurements; the gap is therefore a measurable discrepancy in spatiotemporal gradient and correlation distributions rather than merely a difference in encoder outputs (Zhou et al., 10 Mar 2025). In speech LLMs, the mismatch is temporal-granularity mismatch: speech tokenizers commonly operate at $||c^m-c^n||_2$ 1– $||c^m-c^n||_2$ 2 while text on LibriSpeech is about $||c^m-c^n||_2$ 3, so speech sequences are much longer for matched semantics and have lower per-token semantic density; under fixed information rate this produces an inverted-U relation between frame rate and reasoning performance, with the best regime observed at $||c^m-c^n||_2$ 4 (Ye et al., 10 Jun 2026). “Anatomy of the Modality Gap” complements this by showing that speech representations form a broad cross-layer alignment band and fail to condense redundant acoustic content into stable late-layer decisions, so the bottleneck is not a mere static distribution shift (Hsu et al., 2 Mar 2026).

A third mechanism is input-side perceptual perturbation under otherwise fixed semantics. When text is rendered as pixels for multimodal LLMs, the gap is task- and data-dependent, but grounded-theory analysis showed that image mode selectively amplifies reading errors—especially calculation and formatting—while knowledge/recall and reasoning errors remain largely unchanged. Some models also exhibit chain-of-thought collapse under visual input, so the dominant perturbation is in reading and formatting rather than in knowledge or abstract reasoning capacity (Sun et al., 10 Mar 2026).

4. Empirical manifestations across domains

The theory is empirically supported across heterogeneous benchmarks. In multimodal LLMs reading text as images, seven models were evaluated on seven benchmarks in five input modes. On synthetic renderings, GSM8K exhibited the largest gaps: Qwen3-VL-8B dropped from $||c^m-c^n||_2$ 5 in text mode to $||c^m-c^n||_2$ 6 in image mode, a gap of $||c^m-c^n||_2$ 7 points, while font alone could swing accuracy by up to $||c^m-c^n||_2$ 8 percentage points. By contrast, natural document images often narrowed or reversed the gap: on QASPER, GPT-5.2 increased from $||c^m-c^n||_2$ 9 in text mode to $CosTP_{m,n}$ 0 in pure-image mode (Sun et al., 10 Mar 2026).

In frozen audio–text embeddings, Diffusion-Link reported matched cosine similarity $CosTP_{m,n}$ 1 for audio $CosTP_{m,n}$ 2text and near-zero unmatched similarity $CosTP_{m,n}$ 3, together with downstream AudioCaps gains from CIDEr $CosTP_{m,n}$ 4 to $CosTP_{m,n}$ 5 in zero-shot captioning and from $CosTP_{m,n}$ 6 to $CosTP_{m,n}$ 7 in fully supervised captioning on the same multimodal LLM baseline (Nam et al., 13 Oct 2025). In COMET’s concept-space analysis of CLAP, the shared head comprised only about the top $CosTP_{m,n}$ 8 axes, yet truncating to that head preserved or improved retrieval and captioning; on Clotho, text $CosTP_{m,n}$ 9audio mAP@10 rose from $g=\mu_y-\mu_x$ 0 to $g=\mu_y-\mu_x$ 1, and zero-shot captioning with PLSHead reached SPIDEr $g=\mu_y-\mu_x$ 2 against $g=\mu_y-\mu_x$ 3 for fully supervised audio $g=\mu_y-\mu_x$ 4audio conditioning (Zhu et al., 28 May 2026).

Dense prediction results exhibit the same pattern. In high-dynamic optical flow, ComST-Flow outperformed unimodal and direct-fusion baselines on both synthetic and real data: on Slow-DSEC it achieved EPE/F1-all $g=\mu_y-\mu_x$ 5 versus $g=\mu_y-\mu_x$ 6 for BFlow, and on Fast-DSEC it achieved $g=\mu_y-\mu_x$ 7 versus $g=\mu_y-\mu_x$ 8. Ablations on Fast-DSEC showed progressive improvement from $g=\mu_y-\mu_x$ 9 EPE/ $\Delta=||g||$ 0 with no losses to $\Delta=||g||$ 1 with the full objective, indicating that common-gradient alignment and boundary-guided motion fusion were the critical mechanisms (Zhou et al., 10 Mar 2025).

Retrieval studies reveal especially clear ranking pathologies. In mixed-modality search, replacing text documents with screenshots while preserving semantics produced a U-shaped NDCG@10 curve: performance fell from $\Delta=||g||$ 2 at $\Delta=||g||$ 3 to $\Delta=||g||$ 4 at $\Delta=||g||$ 5, then rose to $\Delta=||g||$ 6 at $\Delta=||g||$ 7, demonstrating intra-modal ranking bias and inter-modal fusion failure in CLIP’s fixed frame. GR-CLIP, a post-hoc mean-shift calibration, improved NDCG@10 by up to $\Delta=||g||$ 8 percentage points over CLIP and converted CLIP ViT-L/14 on MMQA ImageQ from Recall@20 $\Delta=||g||$ 9 to $U$ 0 (Li et al., 25 Jul 2025).

5. Bridging and control strategies

The literature now contains several distinct classes of fixed-frame interventions. A first class uses statistical calibration or centering. I0T applies post-hoc embedding standardization $U$ 1,

$U$ 2

to frozen CLIP embeddings; on ViT-B/32 this reduced centroid distance from $U$ 3 to $U$ 4 and linear separability from $U$ 5 to $U$ 6, while improving Flickr30k retrieval to I2T/T2I R@1 of $U$ 7 (An et al., 2024). Similarity standardization for mixed text–image retrieval instead calibrates similarity scores by modality-specific $U$ 8 and $U$ 9 estimated from top-1 pseudo-positives, yielding average Recall@20 gains of $V$ 0 on MMQA and $V$ 1 on WebQA for cross-modality cases without manual labels (Yamashita et al., 27 Nov 2025). In medical vision–language embeddings, a single hyperparameter $V$ 2 controls a symmetric shift along the empirical centroid gap vector, $V$ 3, $V$ 4, allowing gap modulation with frozen encoders; moderate gap reduction improved AUC consistently, but full collapse was not always optimal (Restrepo et al., 18 Mar 2026).

A second class learns or approximates transport between modality distributions. Diffusion-Link treats the encoder output space as fixed and learns a reverse diffusion trajectory that maps audio embeddings into the text-embedding distribution using a sample-prediction loss and a topology-preserving loss, implemented by a lightweight denoiser with three residual MLP blocks (Nam et al., 13 Oct 2025). ReAlign instead is training-free: it estimates modality-wise means and traces from massive unpaired data, then performs Anchor Alignment, Trace Alignment, and Centroid Alignment on the sphere. ReVision integrates this transport into multimodal LLM pretraining so that unpaired text can be statistically aligned to the image distribution before visual instruction tuning (Yu et al., 2 Feb 2026).

A third class modifies learning dynamics or task structure. “Closing the Modality Gap Aligns Group-Wise Semantics” adds Align True Pairs and Centroid Uniformity to standard contrastive learning, explicitly minimizing per-sample cross-modal offsets while spreading multimodal centroids on the hypersphere (Grassucci et al., 26 Jan 2026). The “contrastive gap” work adds alignment and both in-modal and cross-modal uniformity terms, producing CUA and CUAXU, which reduce centroid distance and linear separability while maintaining retrieval and improving average zero-shot classification (Fahim et al., 2024). Domain-specific variants follow the same fixed-frame logic. In speech translation, Cress combines scheduled sampling with bidirectional KL regularization between speech-conditioned and text-conditioned token distributions and adds token-level adaptive weighting for positions with large cross-modal discrepancy (Fang et al., 2023). In speech reasoning, TARS uses reinforcement learning with representation alignment and behavior alignment rewards so that speech-conditioned trajectories approach contemporaneous text-conditioned trajectories inside a frozen decoder (Wang et al., 9 Jan 2026). For text rendered as pixels, self-distillation trains the model on its own text-mode chain-of-thought traces paired with image inputs, improving GSM8K image-mode accuracy from $V$ 5 to $V$ 6 for Qwen3-VL-8B without catastrophic forgetting (Sun et al., 10 Mar 2026). In speech token design, frame-rate selection and intermediate-layer representation alignment reduce temporal-granularity mismatch rather than only post hoc geometry, yielding the best speech QA regime at $V$ 7 with middle-layer alignment under a frozen Qwen3 backbone (Ye et al., 10 Jun 2026).

6. Debates, limitations, and implications

A major debate concerns whether the modality gap is a defect to eliminate or a structural property to control. The most direct challenge to universal gap-closing is the “contrastive gap” argument: even same-modality twin encoders trained with a two-tower contrastive objective produced a centroid distance increase from $V$ 8 to $V$ 9 and linear separability from $\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ 0 to $\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ 1, suggesting that at least part of the phenomenon is objective-induced rather than modality-induced (Fahim et al., 2024). A second challenge comes from robustness analysis, which proved under explicit assumptions that a global gap vector can be orthogonal to the shared content subspace and that reducing this gap can monotonically improve robustness without changing clean nearest-neighbor accuracy; empirically, robustness gains reached around $\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ 2 percentage points with negligible clean-accuracy cost (Chowers et al., 30 Mar 2026). At the same time, work on group-wise semantics argued that gap reduction matters strongly for clustering and multimodal fusion but only marginally or inconsistently for instance-wise retrieval, with MSCOCO centroid gap dropping from $\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ 3 to $\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ 4 while V-Measure rose from $\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ 5 to $\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ 6 even as cross-modal R@1 decreased from $\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ 7 to $\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ 8 (Grassucci et al., 26 Jan 2026).

The literature therefore does not support a universal prescription of “zero gap.” Medical vision–language results show that intermediate, task-dependent separation can be optimal, especially when modality-specific cues are complementary (Restrepo et al., 18 Mar 2026). Diffusion-based transport warns that over-noising erodes informative content and that over-alignment may reduce modality-specific diversity (Nam et al., 13 Oct 2025). Dynamic speech analyses show that simple feature-level statistical calibration can be insufficient or harmful at the input layer, indicating that some gaps are rooted in temporal redundancy and late-layer decision instability rather than in centroid mismatch alone (Hsu et al., 2 Mar 2026). Event-based optical flow retains edge cases such as radial motion along the camera’s $\Delta(t)=\beta(t)+\gamma(t)+\delta(t)+\zeta(t)$ 9-axis, which the paper suggests may require LiDAR assistance in future work (Zhou et al., 10 Mar 2025). Across speech, video, and text-as-image studies, a consistent implication is that many failures arise at token, frame, or trajectory granularity, so effective interventions may need to operate on temporal condensation, semantic topology, or reasoning dynamics rather than on first-order statistics alone (Ye et al., 10 Jun 2026, Sun et al., 10 Mar 2026).

Taken together, these works present Fixed-frame Modality Gap Theory not as a single doctrine but as a technically specific research program. It studies modality mismatch relative to an explicitly fixed frame, characterizes the gap with domain-appropriate geometry or dynamics, and treats closing, preserving, or tuning that gap as a controllable design choice whose value depends on the downstream objective. In some settings the gap is a nuisance that impairs retrieval, reasoning, or dense prediction; in others it encodes robustness, preserves transferable structure, or carries modality-specific information that should not be collapsed indiscriminately.