MS-Mix: Emotion-Aware Multimodal Augmentation

Updated 19 October 2025

MS-Mix is an adaptive, emotion-aware method for multimodal sentiment analysis that integrates text, video, and audio data.
It employs Sentiment-Aware Sample Selection (SASS) and a Sentiment Intensity Guided (SIG) module to ensure compatibility and dynamic mixing ratios based on emotional intensities.
The approach incorporates a Sentiment Alignment Loss to align cross-modal predictions, yielding superior accuracy and robustness in noisy, heterogeneous data settings.

MS-Mix refers to a set of adaptive, emotion-aware data augmentation mechanisms for multimodal sentiment analysis (MSA), specifically designed to overcome semantic ambiguity and label noise that arise when mixing heterogeneous data sources (text, video, audio) with conventional Mixup-based strategies. The principal innovation is the integration of sentiment-guided sample selection and mixing ratios, along with an explicit regularization mechanism (Sentiment Alignment Loss) that ensures cross-modal prediction alignment. MS-Mix has established new standards for generalization performance and robustness in MSA tasks.

1. Motivation and Challenges of Mixup in Multimodal Sentiment Analysis

Previous applications of Mixup in unimodal tasks have demonstrated improved generalization by linearly interpolating input samples and corresponding labels. However, direct application to multimodal sentiment analysis introduces critical challenges. Mixing samples with opposing emotions (e.g., a strongly positive and a strongly negative utterance) may lead to semantic confusion and ambiguous emotion labels. Furthermore, each modality can exhibit different emotional intensities, compounding the difficulty of coherent fusion.

MS-Mix was conceived to address these limitations by enforcing sentiment compatibility during sample selection, adapting mixing ratios to reflect emotion salience per modality, and aligning prediction distributions across modalities throughout training.

2. Sentiment-Aware Sample Selection (SASS)

The SASS strategy serves to prevent semantic confusion stemming from the random mixing of samples by restricting mixup operations to sentimentally similar samples. Latent features for candidate samples are L₂ normalized, and similarity is computed via cosine similarity: $Z^m_{norm} = Z^m / \|Z^m\|_2,$

$S = \frac{1}{3} \sum_{m \in \{ t, v, a \}} Z^m_{norm} (Z^m_{norm})^T,$

where $m$ indexes modality (text, visual, audio). Sample pairs for mixup are required to exceed a threshold $\delta$ (default $\delta = 0.2$ ). This constraint ensures mixed pairs share analogous emotional content, thereby preserving label integrity.

3. Sentiment Intensity Guided Module (SIG)

The SIG module dynamically determines mixing ratios for each modality based on computed emotional intensities. Each modality’s features $Z^m$ are processed using multi-head self-attention: $\text{head}_i = \text{Softmax}\left( \frac{Z^m W^Q_i (Z^m W^K_i)^T}{\sqrt{d_k}} \right) Z^m W^V_i,$

$\operatorname{MHA}(Z^m) = \operatorname{LN}(\operatorname{Concat}(\text{head}_1,\dots,\text{head}_h) W^O + Z^m),$

$I^m = f_{\varphi}^m(Z^m) = \tanh(\operatorname{GlobalPool}(\operatorname{MHA}(Z^m))),$

where $I^m$ is the emotional intensity for modality $m$ . Weights for mixing are then normalized: $\omega^m_i = \frac{|\ I^m_i| - \min(I^m)}{\max(I^m) - \min(I^m) + \epsilon},$

$\lambda^m_{ij} = \frac{\omega^m_i}{\omega^m_i + \omega^m_j},$

A Beta-distributed base ratio (parameter $\alpha$ , default $\alpha=2.0$ ) is averaged with $\lambda^m_{ij}$ to produce the final mixing weight per modality.

4. Sentiment Alignment Loss (SAL)

To further enforce semantic consistency, MS-Mix introduces the Sentiment Alignment Loss, which penalizes divergence between predicted and true emotion intensity distributions across modalities. Probabilities are extracted using a softmax-like normalization: $P^m = \frac{\exp(I^m)}{\sum \exp(I^m_j)}, \quad P^L = \frac{\exp(Y)}{\sum \exp(Y_j)},$

$\text{KL}(P^L \,\|\, P^m) = \frac{1}{B} \sum_{i=1}^B P^L_i ( \log P^L_i - \log P^m_i ),$

where $B$ is batch size and $Y$ are ground truth intensity labels. $L_{SAL} = \beta \sum_{m \in \{t, v, a\}} \text{KL}(P^L \,\|\, P^m)$ is incorporated into the global loss.

5. Algorithmic Formulation

The MS-Mix mechanism can be summarized as follows for a sample pair $(x_i^m, x_j^m)$ in modality $m$ :

Obtain emotionally compatible sample pairs via SASS.
Compute emotional intensity via SIG.
Interpolate features per modality: $\hat{z}^m = \lambda^m f_\varphi(x_i^m) + (1 - \lambda^m) f_\varphi(x_j^m)$
Labels are similarly combined: $\hat{y} = \lambda^L y_i + (1 - \lambda^L) y_j, \quad \lambda^L = \frac{1}{3}(\lambda_{ij}^t + \lambda_{ij}^v + \lambda_{ij}^a)$
Optimize with joint loss: $L_{total} = L_{task} + \xi_1 L_{mixMSE} + \xi_2 L_{SAL}$ where $L_{mixMSE}$ is the mean square error on mixed labels and $\xi_1$ , $\xi_2$ control loss weighting.

6. Performance Evaluation

MS-Mix was empirically validated on CMU-MOSI, CMU-MOSEI, and CH-SIMS using six leading fusion architectures (e.g., TFN, LMF, MuIT, MISA, ALMT, GLoMo). Across all datasets and backbones, MS-Mix exhibited superior accuracy, F1-score, and lower mean absolute errors compared to prior methods (such as Manifold Mixup, MultiMix, PowMix). Visualizations of latent representations showed improved separability of sentiment classes, indicating enhanced robustness to both label noise and deficient data regimes.

7. Broader Implications and Future Directions

MS-Mix’s integration of SASS and SIG addresses two major limitations in multimodal mixup: semantic incompatibility and lack of emotion-aware interpolation. This suggests a general paradigm for data augmentation in tasks requiring semantic fidelity across modalities. A plausible implication is its applicability to broader multimodal tasks, such as affective computing or mental health monitoring. The authors propose future directions including self-/semi-supervised extensions and optimization for resource-constrained, real-time deployment environments.

MS-Mix represents a significant advancement in multimodal data augmentation, establishing empirically validated procedures for robust fusion in settings characterized by scarce, noisy, or highly heterogeneous annotated data (Zhu et al., 13 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

MS-Mix: Unveiling the Power of Mixup for Multimodal Sentiment Analysis (2025)

Follow Topic

Get notified by email when new papers are published related to MS-Mix.