Papers
Topics
Authors
Recent
2000 character limit reached

Modality-Invariant Representation Learning

Updated 18 November 2025
  • The paper demonstrates a method for aligning modality-specific embeddings by minimizing Central Moment Discrepancy to reduce modality gaps.
  • It employs an autoencoder-based imagination module to reconstruct missing features, ensuring robust performance even with incomplete data.
  • Empirical results show improved emotion recognition accuracy by integrating strategy-driven invariance penalties with cascaded feature alignment.

A modality-invariant representation learning setup is a paradigm in multimodal machine learning that aims to enforce a shared, aligned feature space across heterogeneous input modalities such as audio, text, and vision. The primary goal is to minimize the modality gap by constructing representations whose distributions are aligned across different input types while maintaining sufficient expressivity for downstream tasks. This approach underpins robust multimodal fusion, cross-modal retrieval, zero-shot learning, and resilience to missing or incomplete modalities. The following sections synthesize principles, methodologies, and empirical findings from leading research including "Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities" (Zuo et al., 2022) and related work.

1. Formal Definition of the Modality-Invariant Feature Space

Let MM denote the number of modalities, with each modality mm having a raw input xm∈Rdmx^m\in\mathbb{R}^{d_m}. Modality-specific encoders Encm\mathrm{Enc}_m map the input to an embedding hm=Encm(xm)∈Rhh^m = \mathrm{Enc}_m(x^m)\in\mathbb{R}^h. A shared encoder Enc′\mathrm{Enc}' then projects hmh^m into a modality-invariant subspace Hm=Enc′(hm)∈RdH^m = \mathrm{Enc}'(h^m) \in \mathbb{R}^d. The global modality-invariant representation is formed by concatenation: H=[H1;…;HM]∈RMdH = [H^1; \ldots; H^M] \in \mathbb{R}^{Md}.

The invariant space H\mathcal{H} is characterized by:

  • Alignment: Distributions of {H1,…,HM}\{H^1,\ldots,H^M\} must coincide; that is, for any pair of modalities, the embedding statistics (e.g., moments) are matched.
  • Expressivity: Each HmH^m must preserve sufficient information for discriminating target task labels, such as emotion categories.

This dual objective ensures that modality-invariant features capture shared semantics while suppressing modality-specific noise (Zuo et al., 2022).

2. Core Learning Principles and Objective Functions

Central Moment Discrepancy (CMD) Regularization

Alignment of the invariant feature space can be enforced by minimizing the Central Moment Discrepancy (CMD) between modalities. For each pair (m1,m2)(m_1, m_2), CMD penalizes discrepancies in their distribution up to order KK:

Lcmd=1∣P∣∑(m1,m2)∈P(∥μ(Hm1)−μ(Hm2)∥2+∑k=2K∥Ck(Hm1)−Ck(Hm2)∥2)\mathcal{L}_\text{cmd} = \frac{1}{|P|} \sum_{(m_1, m_2)\in P} \left( \|\mu(H^{m_1}) - \mu(H^{m_2})\|_2 + \sum_{k=2}^K \|C_k(H^{m_1}) - C_k(H^{m_2})\|_2 \right)

where PP is the set of modality pairs, μ(⋅)\mu(\cdot) denotes the mean, and Ck(⋅)C_k(\cdot) is the kk-th central moment (Zuo et al., 2022).

Expressivity and Downstream Task Loss

To ensure that invariance does not obliterate discriminative content, a classification loss (e.g., cross-entropy) is combined with the alignment penalty:

Ltotal-invar=Lcls+α Lcmd\mathcal{L}_\text{total-invar} = \mathcal{L}_\text{cls} + \alpha\, \mathcal{L}_\text{cmd}

where α\alpha is a hyperparameter (Zuo et al., 2022).

Invariant Feature Imagination for Missing Modalities

Missing modality scenarios are addressed by a cascade of autoencoders (the "IF-IM" module) which reconstructs or "imagines" missing hmh^m using the available modalities and the joint invariant feature H′H'. Supplementary losses include:

  • Imagination loss: Penalizes reconstructions of missing features (e.g., RMSE).
  • Invariance loss: Aligns the predicted invariant feature H′H' to the true HH.

The fine-tuning objective becomes:

Ltotal=Lcls+λ1 Limg+λ2 Linv\mathcal{L}_\text{total} = \mathcal{L}_\text{cls} + \lambda_1\, \mathcal{L}_\text{img} + \lambda_2\, \mathcal{L}_\text{inv}

where λ1\lambda_1, λ2\lambda_2 are weights on the auxiliary losses (Zuo et al., 2022).

3. Training Procedures and Data Flow

The modality-invariant learning setup generally proceeds in two stages:

Stage 1: Full-Modality Pre-Training

  1. For input (x1,…,xM)(x^1,\ldots,x^M), encode all modalities.
  2. Compute invariant features and joint representation.
  3. Use both discrimination and CMD alignment losses to update model parameters.

Stage 2: Missing-Modality Fine-Tuning

  1. Remove one or more modalities; only supply the available xmx^m.
  2. Predict H′H', then pass through the imagination cascade to reconstruct the missing hmh^m.
  3. Form the final joint feature for classification.
  4. Optimize the sum of classification, imagination, and invariance losses, updating only the imagination and classifier modules (Zuo et al., 2022).

This structure ensures that the learned representations are robust to arbitrary missing-modality conditions during inference.

4. Relationship to Other Modality-Invariant Learning Approaches

The approach in (Zuo et al., 2022) is closely related to several other strategies:

Study Invariance Mechanism Missing Modalities Main Losses/Constraints
(Zuo et al., 2022) CMD + Imagination module Explicitly handled Classification, CMD, Imagination, Invariant
(Hazarika et al., 2020) CMD + subspace factorization Not primary focus Task, similarity, difference, reconstruction
(Saito et al., 2016) Adversarial alignment (GAN) Not primary focus GAN, pair-matching
(Saeed et al., 14 Aug 2024) Single-branch weight-sharing Explicit Cross-entropy
(He et al., 1 Jun 2025, Koutsouvelis et al., 14 Nov 2025) Symmetric KL, InfoNCE Yes Classification, KL, contrastive, MIM
(Shi et al., 2023) Information Bottleneck Not primary focus Classification, IB

These frameworks share the principle that networks should be penalized when predictions depend on modality-specific idiosyncrasies not relevant to the downstream task.

5. Empirical Evidence and Ablation Analysis

The IF-MMIN setup achieves higher emotion recognition performance under missing-modality conditions than non-invariant baselines. For example, average weighted accuracy (WA) and unweighted accuracy (UA) on IEMOCAP under arbitrary modality dropout is $0.6454$ and $0.6538$, respectively, with both the invariance loss and the cascading of H′H' into every autoencoder layer critical for final results (Zuo et al., 2022).

Ablation (delta WA/UA decrease) demonstrates:

  • Removing the invariance loss: −0.007/−0.0048-0.007 / -0.0048
  • Removing cascaded H′H': −0.005/−0.0024-0.005 / -0.0024

These results confirm each mechanism's necessity for robust cross-modal performance in both fully observed and incomplete settings.

6. Theoretical Considerations and Practical Significance

Modality-invariant representation learning realizes a principled trade-off: maximizing semantic alignment across modalities while avoiding collapse of useful information. The CMD loss specifically targets higher-order central moments, addressing nontrivial distributional discrepancies beyond mean and variance. Combining this with autoencoder-based imagination or adversarial alignment enables generalization to scenarios where some input modalities are absent at test time.

The practical impact is pronounced in real-world applications, such as emotion and sentiment recognition, where sensor failure or data incompleteness is routine. By enforcing invariance in the feature space, the model can gracefully degrade and recover missing inputs, avoiding the sharp performance deterioration characteristic of traditional multimodal networks (Zuo et al., 2022).

7. Limitations and Open Challenges

While modality-invariant setups provide robustness to missing modalities and reduce reliance on all-modality presence, two central challenges persist:

  • Expressivity-Alignment Tension: Excessive penalization for alignment can prune away informative, yet modality-unique, features, potentially degrading task performance.
  • Choice of Invariance Metric: CMD captures moment discrepancies, but selecting the appropriate invariance penalty and its order KK is still empirical.

Moreover, imagination or hallucination of missing modalities is only as reliable as the alignment achieved during full-modality training; performance drops if alignment or reconstruction is imperfect. Further theoretical work is needed to formalize trade-offs and optimality, particularly in highly heterogeneous or dynamically changing modality landscapes.


Key References:

  • "Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities" (Zuo et al., 2022)
  • "MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis" (Hazarika et al., 2020)
  • "Deep Modality Invariant Adversarial Network" (Saito et al., 2016)
  • "Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach" (Saeed et al., 14 Aug 2024)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Modality-Invariant Representation Learning Setup.