Modality-Invariant Representation Learning

Updated 18 November 2025

The paper demonstrates a method for aligning modality-specific embeddings by minimizing Central Moment Discrepancy to reduce modality gaps.
It employs an autoencoder-based imagination module to reconstruct missing features, ensuring robust performance even with incomplete data.
Empirical results show improved emotion recognition accuracy by integrating strategy-driven invariance penalties with cascaded feature alignment.

A modality-invariant representation learning setup is a paradigm in multimodal machine learning that aims to enforce a shared, aligned feature space across heterogeneous input modalities such as audio, text, and vision. The primary goal is to minimize the modality gap by constructing representations whose distributions are aligned across different input types while maintaining sufficient expressivity for downstream tasks. This approach underpins robust multimodal fusion, cross-modal retrieval, zero-shot learning, and resilience to missing or incomplete modalities. The following sections synthesize principles, methodologies, and empirical findings from leading research including "Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities" (Zuo et al., 2022) and related work.

1. Formal Definition of the Modality-Invariant Feature Space

Let $M$ denote the number of modalities, with each modality $m$ having a raw input $x^m\in\mathbb{R}^{d_m}$ . Modality-specific encoders $\mathrm{Enc}_m$ map the input to an embedding $h^m = \mathrm{Enc}_m(x^m)\in\mathbb{R}^h$ . A shared encoder $\mathrm{Enc}'$ then projects $h^m$ into a modality-invariant subspace $H^m = \mathrm{Enc}'(h^m) \in \mathbb{R}^d$ . The global modality-invariant representation is formed by concatenation: $H = [H^1; \ldots; H^M] \in \mathbb{R}^{Md}$ .

The invariant space $\mathcal{H}$ is characterized by:

Alignment: Distributions of $\{H^1,\ldots,H^M\}$ must coincide; that is, for any pair of modalities, the embedding statistics (e.g., moments) are matched.
Expressivity: Each $H^m$ must preserve sufficient information for discriminating target task labels, such as emotion categories.

This dual objective ensures that modality-invariant features capture shared semantics while suppressing modality-specific noise (Zuo et al., 2022).

2. Core Learning Principles and Objective Functions

Central Moment Discrepancy (CMD) Regularization

Alignment of the invariant feature space can be enforced by minimizing the Central Moment Discrepancy (CMD) between modalities. For each pair $(m_1, m_2)$ , CMD penalizes discrepancies in their distribution up to order $K$ :

$\mathcal{L}_\text{cmd} = \frac{1}{|P|} \sum_{(m_1, m_2)\in P} \left( \|\mu(H^{m_1}) - \mu(H^{m_2})\|_2 + \sum_{k=2}^K \|C_k(H^{m_1}) - C_k(H^{m_2})\|_2 \right)$

where $P$ is the set of modality pairs, $\mu(\cdot)$ denotes the mean, and $C_k(\cdot)$ is the $k$ -th central moment (Zuo et al., 2022).

Expressivity and Downstream Task Loss

To ensure that invariance does not obliterate discriminative content, a classification loss (e.g., cross-entropy) is combined with the alignment penalty:

$\mathcal{L}_\text{total-invar} = \mathcal{L}_\text{cls} + \alpha\, \mathcal{L}_\text{cmd}$

where $\alpha$ is a hyperparameter (Zuo et al., 2022).

Invariant Feature Imagination for Missing Modalities

Missing modality scenarios are addressed by a cascade of autoencoders (the "IF-IM" module) which reconstructs or "imagines" missing $h^m$ using the available modalities and the joint invariant feature $H'$ . Supplementary losses include:

Imagination loss: Penalizes reconstructions of missing features (e.g., RMSE).
Invariance loss: Aligns the predicted invariant feature $H'$ to the true $H$ .

The fine-tuning objective becomes:

$\mathcal{L}_\text{total} = \mathcal{L}_\text{cls} + \lambda_1\, \mathcal{L}_\text{img} + \lambda_2\, \mathcal{L}_\text{inv}$

where $\lambda_1$ , $\lambda_2$ are weights on the auxiliary losses (Zuo et al., 2022).

3. Training Procedures and Data Flow

The modality-invariant learning setup generally proceeds in two stages:

Stage 1: Full-Modality Pre-Training

For input $(x^1,\ldots,x^M)$ , encode all modalities.
Compute invariant features and joint representation.
Use both discrimination and CMD alignment losses to update model parameters.

Stage 2: Missing-Modality Fine-Tuning

Remove one or more modalities; only supply the available $x^m$ .
Predict $H'$ , then pass through the imagination cascade to reconstruct the missing $h^m$ .
Form the final joint feature for classification.
Optimize the sum of classification, imagination, and invariance losses, updating only the imagination and classifier modules (Zuo et al., 2022).

This structure ensures that the learned representations are robust to arbitrary missing-modality conditions during inference.

4. Relationship to Other Modality-Invariant Learning Approaches

The approach in (Zuo et al., 2022) is closely related to several other strategies:

Study	Invariance Mechanism	Missing Modalities	Main Losses/Constraints
(Zuo et al., 2022)	CMD + Imagination module	Explicitly handled	Classification, CMD, Imagination, Invariant
(Hazarika et al., 2020)	CMD + subspace factorization	Not primary focus	Task, similarity, difference, reconstruction
(Saito et al., 2016)	Adversarial alignment (GAN)	Not primary focus	GAN, pair-matching
(Saeed et al., 14 Aug 2024)	Single-branch weight-sharing	Explicit	Cross-entropy
(He et al., 1 Jun 2025, Koutsouvelis et al., 14 Nov 2025)	Symmetric KL, InfoNCE	Yes	Classification, KL, contrastive, MIM
(Shi et al., 2023)	Information Bottleneck	Not primary focus	Classification, IB

These frameworks share the principle that networks should be penalized when predictions depend on modality-specific idiosyncrasies not relevant to the downstream task.

5. Empirical Evidence and Ablation Analysis

The IF-MMIN setup achieves higher emotion recognition performance under missing-modality conditions than non-invariant baselines. For example, average weighted accuracy (WA) and unweighted accuracy (UA) on IEMOCAP under arbitrary modality dropout is $0.6454$ and $0.6538$, respectively, with both the invariance loss and the cascading of $H'$ into every autoencoder layer critical for final results (Zuo et al., 2022).

Ablation (delta WA/UA decrease) demonstrates:

Removing the invariance loss: $-0.007 / -0.0048$
Removing cascaded $H'$ : $-0.005 / -0.0024$

These results confirm each mechanism's necessity for robust cross-modal performance in both fully observed and incomplete settings.

6. Theoretical Considerations and Practical Significance

Modality-invariant representation learning realizes a principled trade-off: maximizing semantic alignment across modalities while avoiding collapse of useful information. The CMD loss specifically targets higher-order central moments, addressing nontrivial distributional discrepancies beyond mean and variance. Combining this with autoencoder-based imagination or adversarial alignment enables generalization to scenarios where some input modalities are absent at test time.

The practical impact is pronounced in real-world applications, such as emotion and sentiment recognition, where sensor failure or data incompleteness is routine. By enforcing invariance in the feature space, the model can gracefully degrade and recover missing inputs, avoiding the sharp performance deterioration characteristic of traditional multimodal networks (Zuo et al., 2022).

7. Limitations and Open Challenges

While modality-invariant setups provide robustness to missing modalities and reduce reliance on all-modality presence, two central challenges persist:

Expressivity-Alignment Tension: Excessive penalization for alignment can prune away informative, yet modality-unique, features, potentially degrading task performance.
Choice of Invariance Metric: CMD captures moment discrepancies, but selecting the appropriate invariance penalty and its order $K$ is still empirical.

Moreover, imagination or hallucination of missing modalities is only as reliable as the alignment achieved during full-modality training; performance drops if alignment or reconstruction is imperfect. Further theoretical work is needed to formalize trade-offs and optimality, particularly in highly heterogeneous or dynamically changing modality landscapes.

Key References:

"Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities" (Zuo et al., 2022)
"MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis" (Hazarika et al., 2020)
"Deep Modality Invariant Adversarial Network" (Saito et al., 2016)
"Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach" (Saeed et al., 14 Aug 2024)