Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in Group Conversations (2401.15164v1)

Published 26 Jan 2024 in cs.SD, cs.CV, cs.LG, cs.MM, and eess.AS

Abstract: Analyzing individual emotions during group conversation is crucial in developing intelligent agents capable of natural human-machine interaction. While reliable emotion recognition techniques depend on different modalities (text, audio, video), the inherent heterogeneity between these modalities and the dynamic cross-modal interactions influenced by an individual's unique behavioral patterns make the task of emotion recognition very challenging. This difficulty is compounded in group settings, where the emotion and its temporal evolution are not only influenced by the individual but also by external contexts like audience reaction and context of the ongoing conversation. To meet this challenge, we propose a Multimodal Attention Network that captures cross-modal interactions at various levels of spatial abstraction by jointly learning its interactive bunch of mode-specific Peripheral and Central networks. The proposed MAN injects cross-modal attention via its Peripheral key-value pairs within each layer of a mode-specific Central query network. The resulting cross-attended mode-specific descriptors are then combined using an Adaptive Fusion technique that enables the model to integrate the discriminative and complementary mode-specific data patterns within an instance-specific multimodal descriptor. Given a dialogue represented by a sequence of utterances, the proposed AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level. This helps not only in delivering better classification performance (3-5% improvement in Weighted-F1 and 5-7% improvement in Accuracy) in large-scale public datasets but also helps the users in understanding the reasoning behind each emotion prediction made by the model via its Multimodal Explainability Visualization module.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Shapes of emotions: Multimodal emotion recognition in conversations via emotion shifts. arXiv preprint arXiv:2112.01938, 2021.
  2. Multimodal attentive learning for real-time explainable emotion recognition in conversations. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1210–1214. IEEE, 2022.
  3. Iemocap: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4):335, Nov 2008.
  4. Modeling hierarchical uncertainty for multimodal emotion recognition in conversation. IEEE Transactions on Cybernetics, 2022.
  5. M2fnet: Multi-modal fusion network for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4652–4661, 2022.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  8. Cross-modal stimulus conflict: the behavioral effects of stimulus input timing in a visual-auditory stroop task. PloS one, 8(4):e62802, 2013.
  9. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017.
  10. Cosmic: Commonsense knowledge for emotion identification in conversations, 2020.
  11. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
  12. ICON: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2594–2604, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
  13. Dialoguecrn: Contextual reasoning networks for emotion recognition in conversations. CoRR, abs/2106.01978, 2021.
  14. Dialoguecrn: Contextual reasoning networks for emotion recognition in conversations. arXiv preprint arXiv:2106.01978, 2021.
  15. Unimse: Towards unified multimodal sentiment analysis and emotion recognition. arXiv preprint arXiv:2211.11256, 2022.
  16. Efficient large-scale multi-modal classification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  17. Efficient training of audio transformers with patchout. CoRR, abs/2110.05069, 2021.
  18. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  19. EmoCaps: Emotion capsule based model for conversational emotion recognition. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1610–1618, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  20. Emocaps: Emotion capsule based model for conversational emotion recognition. arXiv preprint arXiv:2203.13504, 2022.
  21. Semi-supervised multi-modal emotion recognition with cross-modal distribution matching. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2852–2861, 2020.
  22. Focal loss for dense object detection. CoRR, abs/1708.02002, 2017.
  23. Dialoguernn: An attentive rnn for emotion detection in conversations, 2019.
  24. Deep multimodal fusion for persuasiveness prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, ICMI ’16, page 284–288, New York, NY, USA, 2016. Association for Computing Machinery.
  25. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  26. Active learning by feature mixing, 2022.
  27. Convolutional mkl based multimodal emotion recognition and sentiment analysis. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 439–448, 2016.
  28. Meld: A multimodal multi-party dataset for emotion recognition in conversations, 2019.
  29. Jaa-net: Joint facial action unit detection and face alignment via adaptive attention. CoRR, abs/2003.08834, 2020.
  30. Learning modality-fused representation based on transformer for emotion analysis. Journal of Electronic Imaging, 31(6):063032, 2022.
  31. Mpnet: Masked and permuted pre-training for language understanding. CoRR, abs/2004.09297, 2020.
  32. Mmgcn: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, page 1437–1445, New York, NY, USA, 2019. Association for Computing Machinery.
  33. Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Transactions on Affective Computing, 2(1):10–21, 2011.
  34. Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion. Sensors, 21(14):4913, 2021.
  35. Hybrid curriculum learning for emotion recognition in conversation. In AAAI Conference on Artificial Intelligence, 2021.
  36. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
  37. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  38. Modeling both context- and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5415–5421. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
  39. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
  40. Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Naresh Kumar Devulapally (4 papers)
  2. Sidharth Anand (3 papers)
  3. Sreyasee Das Bhattacharjee (2 papers)
  4. Junsong Yuan (92 papers)
  5. Yu-Ping Chang (1 paper)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com