Triple Disentangled Representation Learning for Multimodal Affective Analysis (2401.16119v2)
Abstract: Multimodal learning has exhibited a significant advantage in affective analysis tasks owing to the comprehensive information of various modalities, particularly the complementary information. Thus, many emerging studies focus on disentangling the modality-invariant and modality-specific representations from input data and then fusing them for prediction. However, our study shows that modality-specific representations may contain information that is irrelevant or conflicting with the tasks, which downgrades the effectiveness of learned multimodal representations. We revisit the disentanglement issue, and propose a novel triple disentanglement approach, TriDiRA, which disentangles the modality-invariant, effective modality-specific and ineffective modality-specific representations from input data. By fusing only the modality-invariant and effective modality-specific representations, TriDiRA can significantly alleviate the impact of irrelevant and conflicting information across modalities during model training. Extensive experiments conducted on four benchmark datasets demonstrate the effectiveness and generalization of our triple disentanglement, which outperforms SOTA methods.
- Deep canonical correlation analysis. In International conference on machine learning, pages 1247–1255, 2013.
- Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Annual Meeting of the Association for Computational Linguistics, pages 2236–2246, 2018.
- Openface 2.0: Facial behavior analysis toolkit. In IEEE international conference on automatic face & gesture recognition, pages 59–66. IEEE, 2018.
- A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. The Visual Computer, 38:1–32, 2021.
- Domain separation networks. Advances in neural information processing systems, 29, 2016.
- Covarep—a collaborative voice analysis repository for speech technologies. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 960–964. IEEE, 2014.
- Uavm: Towards unifying audio and visual models. IEEE Signal Processing Letters, 29:2437–2441, 2022.
- Robust learning with the hilbert-schmidt independence criterion. In International Conference on Machine Learning, pages 3759–3768. PMLR, 2020.
- Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Conference on Empirical Methods in Natural Language Processing, pages 9180–9192, 2021.
- Ur-funny: A multimodal language dataset for understanding humor. arXiv preprint arXiv:1904.06618, 2019.
- MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In ACM International Conference on Multimedia, pages 1122–1131, 2020.
- Harold Hotelling. Relations Between Two Sets of Variates, pages 162–190. 1992.
- UniMSE: Towards unified multimodal sentiment analysis and emotion recognition. In Conference on Empirical Methods in Natural Language Processing, pages 7837–7851, 2022.
- What makes multi-modal learning better than single (provably). Advances in Neural Information Processing Systems, 34:10944–10956, 2021.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, pages 4171–4186, 2019.
- Decoupled multimodal distilling for emotion recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6631–6640, 2023.
- Fixing weight decay regularization in adam. International Conference on Learning Representations, 2018.
- Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Transactions on Affective Computing, 2022.
- DialogueRNN: An attentive rnn for emotion detection in conversations. In AAAI conference on artificial intelligence, pages 6818–6825, 2019.
- librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, pages 18–25, 2015.
- M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In AAAI conference on artificial intelligence, pages 1359–1367, 2020.
- A review of affective computing: From unimodal analysis to multimodal fusion. Information fusion, 37:98–125, 2017.
- Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508, 2018.
- Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research. IEEE Transactions on Affective Computing, 2020.
- Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In AAAI Conference on Artificial Intelligence, pages 8992–8999, 2020.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop, pages 1–5, 2015.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- A systematic review on affective computing: emotion models, databases, and recent advances. Information Fusion, 83-84:19–52, 2022.
- Disentangled representation learning for multimodal emotion recognition. In ACM International Conference on Multimedia, pages 1642–1651, 2022a.
- Learning modality-specific and-agnostic representations for asynchronous multimodal language sequences. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1708–1717, 2022b.
- Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In AAAI conference on artificial intelligence, pages 10790–10797, 2021.
- MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259, 2016.
- Central moment discrepancy (cmd) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811, 2017.
- Tailor versatile multi-modal learning for multi-label emotion recognition. In AAAI Conference on Artificial Intelligence, pages 9100–9108, 2022.
- Adaptive mask co-optimization for modal dependence in multimodal learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023.