Cross-Modal Compensation (CMC) Overview

Updated 4 July 2026

Cross-Modal Compensation (CMC) is a multimodal strategy where one modality supplements or substitutes another to mitigate missing, noisy, or mismatched data.
CMC leverages explicit recovery, attention regularization, and representation reconciliation to enhance system robustness in applications like emotion recognition and action detection.
By employing fusion, residual adaptation, and mediator-based transfer, CMC techniques enable models to recover lost information and improve overall prediction accuracy.

Cross-Modal Compensation (CMC) denotes a family of multimodal learning strategies in which one modality supplements, calibrates, regularizes, or substitutes for another when information is missing, noisy, weak, ambiguous, or distributionally mismatched. In its strongest form, CMC detects corrupted inputs or recovers absent modalities; in weaker but widely used forms, it enforces cross-modal correspondence, corrects attention allocation, or transfers structure across representation spaces. The literature is technically heterogeneous: explicit compensation appears in robotics, action recognition, incomplete multimodal emotion recognition, and multimodal post-training quantization, while closely related work frames the same underlying idea as attention consistency, correspondence loss, bidirectional interaction, or modality adaptation (Lee et al., 2020, Song et al., 2020, He et al., 12 Dec 2025, Hu et al., 5 Mar 2026, Min et al., 2021, Makishima et al., 2021).

1. Conceptual scope and acronym ambiguity

The term is not used uniformly across arXiv. In smart-data analysis, multimodal AI is explicitly described as collecting heterogeneous data to “compensate for complementary information,” while crossmodal AI is described as utilizing one modality to predict another by discovering “common attention sharing” between them; in that formulation, compensation covers both fusion and substitution (Dao, 2022). Elsewhere, however, the acronym CMC names technically distinct objects: Cross Model Compatibility in visual search (Wang et al., 2020), Cross Modal Compression for semantic compression of visual data (Li et al., 2022), cross-modal CutMix in unpaired vision-language pre-training (Wang et al., 2022), and Cross-Modal Calibration in HOI detection (Yuan et al., 2022).

This suggests that CMC is best treated as an umbrella concept rather than a single architecture. Under that broader reading, the central invariant is not the acronym itself but the operation: one modality provides information that another modality lacks, whether by direct recovery, reliability-aware rejection, feature-space adaptation, attention refinement, or semantic prior injection. A narrower reading reserves the term for explicit handling of missing or corrupted modalities, as in the Crossmodal Compensation Model for sensor corruption and the Modality Compensation Network for learning RGB/flow representations that compensate for missing skeletons at test time (Lee et al., 2020, Song et al., 2020).

2. Recurrent technical regimes

A useful synthesis is to distinguish several recurring CMC regimes. This taxonomy is inferential, but it closely matches the mechanisms named in the underlying papers.

Regime	Compensation target	Representative papers
Explicit modality recovery or substitution	Missing or corrupted modality	CMCGAN (Hao et al., 2017), MCN (Song et al., 2020), CCM (Lee et al., 2020), ComP (He et al., 12 Dec 2025)
Attention or correspondence regularization	Attentional allocation or latent alignment	CMAC (Min et al., 2021), audio-visual CMC loss (Makishima et al., 2021), CroBIM (Dong et al., 2024), OCN (Yuan et al., 2022)
Representation-space reconciliation	Heterogeneous embeddings or quantized weights	Cross Model Compatibility (Wang et al., 2020), MASQuant (Hu et al., 5 Mar 2026)
Mediator-based transfer under missing links	Absent pairwise supervision	Continual Cross-Modal Generalization (Xia et al., 1 Apr 2025)

In the first regime, compensation is operationally explicit. CMCGAN formulates cross-modal visual-audio mutual generation with four subnetworks—audio-to-visual, visual-to-audio, audio-to-audio, and visual-to-visual—organized in a cycle architecture, with a joint corresponding adversarial loss and a Gaussian latent vector to handle modality asymmetry; it is explicitly motivated by the case where one modality is abandoned or missing (Hao et al., 2017). MCN uses skeletons as an auxiliary modality during training only, so RGB and optical flow learn source features that “compensate for the loss of skeletons at test time and even at training time” (Song et al., 2020). CCM detects corrupted sensor modalities, discards them, and compensates with the remaining sensors (Lee et al., 2020). ComP transfers concise semantic cues across audio, text, and video streams and then reweights their outputs to address incomplete multimodal emotion recognition (He et al., 12 Dec 2025).

In the second regime, the model does not necessarily reconstruct a missing modality. Instead, one modality sharpens the internal computations of another. CMAC aligns vision-only attention with audio-guided visual attention and audio-only attention with visual-guided audio attention (Min et al., 2021). The audio-visual speech separation paper uses a Cross-Modal Correspondence loss so that separated speech features match the target speaker’s visual stream and not other speakers’ streams (Makishima et al., 2021). CroBIM adds an “attention deficit compensation mechanism” to repair cross-scale visual inconsistencies under language guidance (Dong et al., 2024). OCN frames the same broad idea as calibration: semantic features compensate for weak visual verb prediction, while visual evidence corrects static semantic priors (Yuan et al., 2022).

The remaining regimes generalize compensation away from raw sensing. Cross Model Compatibility compensates for distribution shift between embedding models by adapting both query and gallery representations into a unified space (Wang et al., 2020). MASQuant compensates for cross-modal quantization mismatch by adding modality-specific low-rank corrections on top of one shared quantized weight (Hu et al., 5 Mar 2026). Continual Cross-Modal Generalization uses a mediator modality, a shared discrete codebook, and pseudo-modality replay to compensate for the absence of direct pairwise supervision across newly added modalities (Xia et al., 1 Apr 2025).

3. Objective functions and algorithmic structure

Despite their diversity, CMC methods repeatedly optimize one of three objects: reconstruction fidelity, representation consistency, or cross-modal correspondence. In MCN, the generic objective augments supervised classification with a modality adaptation term,

$\mathcal{L} = -\sum_{i=1}^n \sum_{c=1}^C l_{i,c}\log p(c|\mathbf{V}_i) + \lambda d,$

where $d$ can be a domain-level MMD term $d_D$ , a category-level alignment term $d_C$ , or a sample-level feature-matching term

$d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$

This makes compensation a feature-space adaptation problem: source representations are pushed toward auxiliary skeleton representations while retaining source-specific information through a residual path (Song et al., 2020).

In CCM, compensation is trained directly in latent space. The model learns a multimodal latent state $z_{mult}=f(o_1,\dots,o_n)$ with a variational objective and then forces the latent produced with one dropped modality, $z'_{mult}$ , to remain close to the full latent through

$ELBO(o_i, y_i, a_i) + \|z_{mult} - z'_{mult}\|_2^2.$

The same reconstruction machinery later becomes the corruption detector: unimodal reconstruction error is used as the anomaly score for deciding which modality to reject (Lee et al., 2020).

In CMAC, compensation is formulated as bidirectional local correspondence. A modality-specific filter is extracted by global average pooling,

$\boldsymbol{\kappa}_{n}^{v} = pool(g_{v}(\boldsymbol{v}_{n})), \quad \boldsymbol{\kappa}_{n}^{a} = pool(g_{a}(\boldsymbol{a}_{n})),$

then used to generate cross-modal target attentions

$\boldsymbol{s}_{n}^{v} = norm(\boldsymbol{\kappa}_{n}^{a}*g_{v}(\boldsymbol{v}_{n})),\quad \boldsymbol{s}_{n}^{a} = norm(\boldsymbol{\kappa}_{n}^{v}*g_{a}(\boldsymbol{a}_{n})).$

Those targets supervise within-modal attention predictors through an $d$ 0 consistency loss, combined with a remoulded contrastive objective that adds within-modal negatives (Min et al., 2021).

In MASQuant, compensation appears in weight space rather than feature extraction. For a non-text modality $d$ 1, the model stores only the text-smoothed quantized weight $d$ 2 and compensates the residual

$d$ 3

with a low-rank term obtained after SVD whitening:

$d$ 4

Inference for $d$ 5 then becomes

$d$ 6

Here compensation resolves the conflict between modality-specific smoothing and the desire to keep one shared quantized backbone (Hu et al., 5 Mar 2026).

4. Explicit handling of missing or corrupted modalities

The clearest CMC formulations arise when a modality is absent, unreliable, or only available during training. CMCGAN is explicitly motivated by videos in which “only one modality exists while the other is abandoned or missing.” It proposes a cross-modal cycle GAN with four subnetworks and reports that the generated modality achieves “comparable effects with those of original modality,” with a downstream dynamic multimodal classification network for the modality-missing problem (Hao et al., 2017).

MCN addresses a different missing-modality setting: skeletons are auxiliary at training time but unavailable at test time. Rather than reconstructing skeletons, MCN adapts RGB and optical-flow features toward the auxiliary skeleton space. On NTU RGB+D, sample-level adaptation improves NTU-CS RGB from $d$ 7 to $d$ 8 and flow from $d$ 9 to $d_D$ 0; on MSR 3D Daily Activity it improves RGB from $d_D$ 1 to $d_D$ 2 and flow from $d_D$ 3 to $d_D$ 4, illustrating that compensation can remain useful even when the compensated-for modality never appears at inference (Song et al., 2020).

CCM makes the missingness decision online. It follows a strict “Detect, Reject, Correct” pipeline: detect corruption with reconstruction error, reject the corrupted modality, and correct by recomputing the latent from remaining sensors. On peg insertion with corrupted inputs, CCM reports $d_D$ 5 success for corrupted image compensation, $d_D$ 6 for corrupted depth compensation, and $d_D$ 7 for corrupted force compensation, markedly outperforming baselines that either do not align full and dropped-modality latents or do not use reconstruction for detection (Lee et al., 2020).

ComP treats incomplete multimodal emotion recognition as both a missingness and modality-imbalance problem. Its progressive prompt generation and cross-modal knowledge propagation modules perform representation-level compensation, while the coordinator performs decision-level compensation by reweighting modality outputs. The paper reports consistent gains across CMU-MOSI, CMU-MOSEI, IEMOCAPFour, and IEMOCAPSix under missing rates from $d_D$ 8 to $d_D$ 9, and its ablation identifies knowledge propagation as the most foundational compensation component (He et al., 12 Dec 2025).

A broader generalization appears in Continual Cross-Modal Generalization. There the missing object is not a modality input but direct pairwise supervision: staged bimodal datasets such as $d_C$ 0, $d_C$ 1, and $d_C$ 2 are used to support unseen transfers such as $d_C$ 3 or $d_C$ 4. The mediator modality, dynamic codebook, pseudo-modality replay, and EWC-regularized adapters collectively compensate for absent all-to-all pairing (Xia et al., 1 Apr 2025).

5. Attention, correspondence, and calibration as weaker forms of compensation

A common misconception is that compensation must mean full modality imputation. Several influential systems instead use one modality to sharpen another without ever reconstructing it. CMAC is explicit on this point: it is “a bidirectional cross-modal attention regularization framework with compensation-like effects,” not a missing-modality system. Audio guides visual attention toward sounding regions, and vision guides audio attention toward object-relevant frequencies; the full model improves UCF101 from $d_C$ 5 without $d_C$ 6 to $d_C$ 7 with attention consistency, and reaches $d_C$ 8 on UCF101 and $d_C$ 9 on ESC50 in full evaluation (Min et al., 2021).

In audio-visual speech separation, the CMC mechanism is a training-time correspondence constraint rather than test-time substitution. The loss

$d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$ 0

forces separated audio embeddings to align with the target speaker’s visual stream and become orthogonal to other speakers’ visual streams. With $d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$ 1-hour training, the proposed model improves the AV baseline from SDR $d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$ 2, PESQ $d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$ 3, STOI $d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$ 4 to SDR $d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$ 5, PESQ $d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$ 6, STOI $d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$ 7, supporting the claim that visual information compensates for what plain audio reconstruction loss fails to constrain (Makishima et al., 2021).

In CroBIM for referring remote sensing image segmentation, compensation is distributed across three modules. CAPM lets image context compensate insufficient language grounding, LGFA lets language compensate weak or cluttered visual features, and the attention deficit compensation mechanism explicitly identifies where adjacent scales disagree and repairs those regions by cross-scale self-attention. On RISBench, the full configuration reaches [email protected] $d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$ 8 and mIoU $d_S = \frac{1}{n}\sum_{i=1}^n\|\hat{\mathbf{a}}_i-\hat{\mathbf{r}}_i\|^2.$ 9, with clear gains over partial compensation configurations (Dong et al., 2024).

In OCN for HOI detection, the authors speak of calibration rather than compensation, but the effect is similar. Object-guided semantic aggregation creates per-query verb semantics, and InterC plus IntraEC then produce semantic-aware visual features and vision-aware semantic features. On HICO-DET, the base vision model attains $z_{mult}=f(o_1,\dots,o_n)$ 0 Full mAP and $z_{mult}=f(o_1,\dots,o_n)$ 1 Rare, while full OCN reaches $z_{mult}=f(o_1,\dots,o_n)$ 2 Full and $z_{mult}=f(o_1,\dots,o_n)$ 3 Rare; the strongest gains on Rare suggest that cross-modal calibration is especially useful when the verb predictor is weak or the prior is sparse (Yuan et al., 2022).

A second misconception is that all work labeled CMC belongs to the same family. It does not. Cross Model Compatibility addresses incompatibility between old and new embedding models in visual search, learning a shared space through similarity, classification, and KL consistency losses; the compensation target is representation shift across models, not across sensory modalities (Wang et al., 2020). Cross Modal Compression uses modality transformation for semantic compression, for example image $z_{mult}=f(o_1,\dots,o_n)$ 4 text $z_{mult}=f(o_1,\dots,o_n)$ 5 image, and is only indirectly relevant to compensation as a form of semantic substitution (Li et al., 2022). cross-modal CutMix creates “multi-modal sentences” by replacing grounded words with semantically matched image patches, functioning primarily as augmentation, denoising, and implicit alignment in unpaired VLP (Wang et al., 2022).

The literature also reveals several stable limitations. Pseudo-attention can be noisy: CMAC notes that overly strong attention-consistency weighting hurts because audio-guided localized regions are not always accurate (Min et al., 2021). Auxiliary quality matters: MCN reports that noisy extracted poses can weaken or even harm compensation (Song et al., 2020). Prior-aware calibration can be brittle under distribution shift: OCN explicitly notes trouble for zero-shot HOI detection because its semantic structure depends on dataset priors (Yuan et al., 2022). Mediator-based continual transfer assumes a mediator modality with partial semantic overlap across stages, so its compensation is strongest for missing-pair supervision rather than arbitrary missing-modality reasoning (Xia et al., 1 Apr 2025). MASQuant, finally, requires calibration data and depends on the empirical fact that whitening makes cross-modal residuals effectively low-rank (Hu et al., 5 Mar 2026).

Taken together, these results support a precise but plural definition. Cross-Modal Compensation is not a single algorithmic primitive. It is a design principle stating that multimodal systems should exploit redundancy, complementarity, and cross-modal structure so that one modality can repair, replace, or refine another when direct evidence is insufficient. In some systems that principle is realized as explicit detect-reject-correct logic; in others it appears as residual feature adaptation, mediator-based transfer, semantic calibration, or low-rank computational correction. The unifying technical question is always the same: how much of the information lost, corrupted, or under-modeled in one modality can be recovered from another without destroying modality-specific structure.