AM^2-EmoJE: Adaptive Missing-Modality Emotion Recognition in Conversation via Joint Embedding Learning (2402.10921v1)
Abstract: Human emotion can be presented in different modes i.e., audio, video, and text. However, the contribution of each mode in exhibiting each emotion is not uniform. Furthermore, the availability of complete mode-specific details may not always be guaranteed in the test time. In this work, we propose AM2-EmoJE, a model for Adaptive Missing-Modality Emotion Recognition in Conversation via Joint Embedding Learning model that is grounded on two-fold contributions: First, a query adaptive fusion that can automatically learn the relative importance of its mode-specific representations in a query-specific manner. By this the model aims to prioritize the mode-invariant spatial query details of the emotion patterns, while also retaining its mode-exclusive aspects within the learned multimodal query descriptor. Second the multimodal joint embedding learning module that explicitly addresses various missing modality scenarios in test-time. By this, the model learns to emphasize on the correlated patterns across modalities, which may help align the cross-attended mode-specific descriptors pairwise within a joint-embedding space and thereby compensate for missing modalities during inference. By leveraging the spatio-temporal details at the dialogue level, the proposed AM2-EmoJE not only demonstrates superior performance compared to the best-performing state-of-the-art multimodal methods, by effectively leveraging body language in place of face expression, it also exhibits an enhanced privacy feature. By reporting around 2-5% improvement in the weighted-F1 score, the proposed multimodal joint embedding module facilitates an impressive performance gain in a variety of missing-modality query scenarios during test time.
- “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” 2019.
- “Iemocap: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335, Nov 2008.
- “M2fnet: Multi-modal fusion network for emotion recognition in conversation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4652–4661.
- “EmoCaps: Emotion capsule based model for conversational emotion recognition,” in Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 2022, pp. 1610–1618, Association for Computational Linguistics.
- “Efficient large-scale multi-modal classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018, vol. 32.
- “Learning modality-fused representation based on transformer for emotion analysis,” Journal of Electronic Imaging, vol. 31, no. 6, pp. 063032, 2022.
- “Are multimodal transformers robust to missing modality?,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18156–18165.
- “Multimodal prompting with missing modalities for visual recognition,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14943–14952, 2023.
- “Modality translation-based multimodal sentiment analysis under uncertain missing modalities,” Information Fusion, vol. 101, pp. 101973, 2024.
- “Tag-assisted multimodal sentiment analysis under uncertain missing modalities,” in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 2022, SIGIR ’22, p. 1545–1554, Association for Computing Machinery.
- “M3ae: Multimodal representation learning for brain tumor segmentation with missing modalities,” in Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. 2023, AAAI’23/IAAI’23/EAAI’23, AAAI Press.
- “M3care: Learning with missing modalities in multimodal healthcare data,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2022, KDD ’22, p. 2418–2428, Association for Computing Machinery.
- “Sentence-bert: Sentence embeddings using siamese bert-networks,” CoRR, vol. abs/1908.10084, 2019.
- “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- “Efficient training of audio transformers with patchout,” in Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022. 2022, pp. 2753–2757, ISCA.
- “Active learning by feature mixing,” 2022.
- “Focal loss for dense object detection,” CoRR, vol. abs/1708.02002, 2017.
- “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- “Memory fusion network for multi-view sequential learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018, vol. 32.
- “ICON: Interactive conversational memory network for multimodal emotion detection,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Oct.-Nov. 2018, pp. 2594–2604, Association for Computational Linguistics.
- “Dialoguernn: An attentive rnn for emotion detection in conversations,” 2019.
- “Modeling both context- and speaker-sensitive dependence for emotion detection in multi-speaker conversations,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. 7 2019, pp. 5415–5421, International Joint Conferences on Artificial Intelligence Organization.
- “Dialoguecrn: Contextual reasoning networks for emotion recognition in conversations,” arXiv preprint arXiv:2106.01978, 2021.
- “Emocaps: Emotion capsule based model for conversational emotion recognition,” arXiv preprint arXiv:2203.13504, 2022.
- “Semi-supervised multi-modal emotion recognition with cross-modal distribution matching,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 2852–2861.
- “Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion,” Sensors, vol. 21, no. 14, pp. 4913, 2021.
- “Modeling hierarchical uncertainty for multimodal emotion recognition in conversation,” IEEE Transactions on Cybernetics, 2022.
- “Shapes of emotions: Multimodal emotion recognition in conversations via emotion shifts,” arXiv preprint arXiv:2112.01938, 2021.
- “Unimse: Towards unified multimodal sentiment analysis and emotion recognition,” arXiv preprint arXiv:2211.11256, 2022.
- “Learning emotion representations from verbal and nonverbal communication,” 2023.
- “Hybrid curriculum learning for emotion recognition in conversation,” in AAAI Conference on Artificial Intelligence, 2021.
- “Light-sernet: A lightweight fully convolutional neural network for speech emotion recognition,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6912–6916.
- “Mmgcn: Multi-modal graph convolution network for personalized recommendation of micro-video,” in Proceedings of the 27th ACM International Conference on Multimedia, New York, NY, USA, 2019, MM ’19, p. 1437–1445, Association for Computing Machinery.
- Naresh Kumar Devulapally (4 papers)
- Sidharth Anand (3 papers)
- Sreyasee Das Bhattacharjee (2 papers)
- Junsong Yuan (92 papers)