Fine-grained Disentangled Representation Learning for Multimodal Emotion Recognition (2312.13567v1)
Abstract: Multimodal emotion recognition (MMER) is an active research field that aims to accurately recognize human emotions by fusing multiple perceptual modalities. However, inherent heterogeneity across modalities introduces distribution gaps and information redundancy, posing significant challenges for MMER. In this paper, we propose a novel fine-grained disentangled representation learning (FDRL) framework to address these challenges. Specifically, we design modality-shared and modality-private encoders to project each modality into modality-shared and modality-private subspaces, respectively. In the shared subspace, we introduce a fine-grained alignment component to learn modality-shared representations, thus capturing modal consistency. Subsequently, we tailor a fine-grained disparity component to constrain the private subspaces, thereby learning modality-private representations and enhancing their diversity. Lastly, we introduce a fine-grained predictor component to ensure that the labels of the output representations from the encoders remain unchanged. Experimental results on the IEMOCAP dataset show that FDRL outperforms the state-of-the-art methods, achieving 78.34% and 79.44% on WAR and UAR, respectively.
- “Toward detecting emotions in spoken dialogs,” IEEE transactions on speech and audio processing, vol. 13, no. 2, pp. 293–303, 2005.
- “Artificial emotion and social robotics,” Distributed autonomous robotic systems 4, pp. 121–130, 2000.
- “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture,” in 2004 IEEE international conference on acoustics, speech, and signal processing. IEEE, 2004, vol. 1, pp. I–577.
- “Words can shift: Dynamically adjusting word representations using nonverbal behaviors,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 7216–7223.
- “Tailor versatile multi-modal learning for multi-label emotion recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 9100–9108.
- “Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework,” Speech Communication, vol. 139, pp. 1–9, 2022.
- “Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2554–2562.
- “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2019, vol. 2019, p. 6558.
- “Towards Paralinguistic-Only Speech Representations for End-to-End Speech Emotion Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 1853–1857.
- “Misa: Modality-invariant and-specific representations for multimodal sentiment analysis,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 1122–1131.
- “Disentangled representation learning for multimodal emotion recognition,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 1642–1651.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
- “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- “Transfer learning with dynamic adversarial adaptation network,” in 2019 IEEE international conference on data mining (ICDM). IEEE, 2019, pp. 778–786.
- “Domain-adversarial training of neural networks,” The journal of machine learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
- “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
- “Speech emotion recognition using self-supervised features,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6922–6926.
- “Multi-modal emotion recognition with self-guided modality calibration,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4688–4692.
- “Multi-level fusion of wav2vec 2.0 and bert for multimodal emotion recognition,” arXiv preprint arXiv:2207.04697, 2022.
- “Mgat: Multi-granularity attention based transformers for multi-modal emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Sdtf-net: Static and dynamic time–frequency network for speech emotion recognition,” Speech Communication, vol. 148, pp. 1–8, 2023.
- “Knowledge-aware bayesian co-attention for multimodal emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne.,” Journal of machine learning research, vol. 9, no. 11, 2008.
- Haoqin Sun (18 papers)
- Shiwan Zhao (47 papers)
- Xuechen Wang (9 papers)
- Wenjia Zeng (5 papers)
- Yong Chen (299 papers)
- Yong Qin (35 papers)