Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition (2304.06910v2)

Published 14 Apr 2023 in eess.AS, cs.CL, and cs.SD

Abstract: Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. “Affective multimodal human-computer interaction,” in ACM international conference on Multimedia, 2005, pp. 669–676.
  2. “Emotion detection and analysis on social media,” arXiv preprint arXiv:1901.08458, 2019.
  3. “Acoustic and lexical sentiment analysis for customer service calls,” in ICASSP. IEEE, 2019, pp. 5876–5880.
  4. “Emokey: An emotion-aware smartphone keyboard for mental health monitoring,” in COMSNETS. IEEE, 2019, pp. 496–499.
  5. “Emotion recognition in conversation: Research challenges, datasets, and recent advances,” IEEE Access, vol. 7, pp. 100943–100953, 2019.
  6. “Emotion recognition using facial expressions,” Procedia Computer Science, vol. 108, pp. 1175–1184, 2017.
  7. Vocal expression of emotion., Oxford University Press, 2003.
  8. Costanza Navarretta, “Individuality in communicative bodily behaviours,” in Cognitive Behavioural Systems, pp. 417–423. Springer, 2012.
  9. “Physiological signals and their use in augmenting emotion recognition for human–machine interaction,” in Emotion-oriented systems, pp. 133–159. Springer, 2011.
  10. “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, 2017.
  11. “Isla: Temporal segmentation and labeling for audio-visual emotion recognition,” IEEE Transactions on Affective Computing, vol. 10, no. 2, pp. 196–208, 2017.
  12. “Information fusion in attention networks using adaptive and multi-level factorized bilinear pooling for audio-visual emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2617–2629, 2021.
  13. “Multimodal speech emotion recognition using audio and text,” in SLT. IEEE, 2018, pp. 112–118.
  14. “Fusion approaches for emotion recognition from speech using acoustic and text-based features,” in ICASSP. IEEE, 2020, pp. 6484–6488.
  15. “M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues,” in AAAI, 2020, vol. 34, pp. 1359–1367.
  16. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
  17. “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  18. “Supervised contrastive learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 18661–18673, 2020.
  19. “IEMOCAP: Interactive emotional dyadic motion capture database,” LREC, vol. 42, no. 4, pp. 335–359, 2008.
  20. “MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations,” in ACL, 2019, pp. 527–536.
  21. “MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos,” arXiv preprint arXiv:1606.06259, 2016.
  22. “Perceptual cues in nonverbal vocal expressions of emotion,” Quarterly Journal of Experimental Psychology, vol. 63, no. 11, pp. 2251–2272, 2010.
  23. “Automatic emotion recognition using prosodic parameters,” in Interspeech, 2005.
  24. “Speech emotion recognition using segmental level prosodic analysis,” in ICDeCom. IEEE, 2011, pp. 1–5.
  25. “Affect recognition in real-life acoustic conditions-a new perspective on feature selection,” in Interspeech, 2013.
  26. “The Interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,” in Interspeech, 2013.
  27. “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,” IEEE transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2015.
  28. “Opensmile: the Munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462.
  29. “Speaker recognition from raw waveform with sincnet,” in SLT. IEEE, 2018, pp. 1021–1028.
  30. “LEAF: A Learnable Frontend for Audio Classification,” ICLR, 2021.
  31. “Modulation filter learning using deep variational networks for robust speech recognition,” IEEE journal of selected topics in signal processing, vol. 13, no. 2, pp. 244–253, 2019.
  32. “Interpretable representation learning for speech and audio signals based on relevance weighting,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2823–2836, 2020.
  33. “wav2vec: Unsupervised pre-training for speech recognition,” Interspeech, pp. 3465–3469, 2019.
  34. “Jointly Fine-Tuning “BERT-Like” Self Supervised Models to Improve Multimodal Speech Emotion Recognition,” Interspeech, pp. 3755–3759, 2020.
  35. “On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition,” in SLT. IEEE, 2021, pp. 373–380.
  36. “Emotion recognition from speech using wav2vec 2.0 embeddings,” Interspeech, pp. 3400–3404, 2021.
  37. “Exploring the use of word relation features for sentiment classification,” in Coling 2010: Posters, 2010, pp. 1336–1344.
  38. “Advances in pre-training distributed word representations,” in LREC, 2018.
  39. “Multimodal sentiment analysis using hierarchical fusion with context modeling,” Knowledge-based systems, vol. 161, pp. 124–133, 2018.
  40. “Vector representation of words for sentiment analysis using glove,” in ICCT. IEEE, 2017, pp. 279–284.
  41. Haixia Liu, “Sentiment analysis of citations using word2vec,” arXiv preprint arXiv:1704.00177, 2017.
  42. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  43. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in NAACL-HLT, 2019, pp. 4171–4186.
  44. “SMIN: Semi-supervised Multi-modal Interaction Network for Conversational Emotion Recognition,” IEEE Transactions on Affective Computing, 2022.
  45. “Context-dependent sentiment analysis in user-generated videos,” in ACL (volume 1: Long papers), 2017, pp. 873–883.
  46. “Multimodal transformer for unaligned multimodal language sequences,” in ACL. NIH Public Access, 2019, vol. 2019, p. 6558.
  47. “Multimodal transformer with learnable frontend and self attention for emotion recognition,” in ICASSP. IEEE, 2022, pp. 6917–6921.
  48. “Memory fusion network for multi-view sequential learning,” in AAAI, 2018, vol. 32.
  49. “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” in ACL (Volume 1: Long Papers), 2018, pp. 2236–2246.
  50. “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
  51. “Lxmert: Learning cross-modality encoder representations from transformers,” in EMNLP-IJCNLP, 2019, pp. 5100–5111.
  52. “Dialoguernn: An attentive rnn for emotion detection in conversations,” in AAAI, 2019, vol. 33, pp. 6818–6825.
  53. “Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation,” in EMNLP-IJCNLP, 2019, pp. 154–164.
  54. “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
  55. “Libri-light: A benchmark for ASR with limited or no supervision,” in ICASSP. IEEE, 2020, pp. 7669–7673.
  56. “Common voice: A massively-multilingual speech corpus,” in LREC, 2020, pp. 4218–4222.
  57. “Switchboard: Telephone speech corpus for research and development,” in ICASSP. IEEE, 1992, pp. 517–520.
  58. “The fisher corpus: A resource for the next generations of speech-to-text.,” in LREC, 2004, vol. 4, pp. 69–71.
  59. “Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing,” in ACL, 2019, pp. 481–492.
  60. “Towards discriminative representation learning for speech emotion recognition.,” in IJCAI, 2019, pp. 5060–5066.
  61. “Locally confined modality fusion network with a global perspective for multimodal human affective computing,” IEEE Transactions on Multimedia, vol. 22, no. 1, pp. 122–137, 2019.
  62. “Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition,” in AAAI, 2021, vol. 35, pp. 13789–13797.
  63. “Dialoguetrm: Exploring multi-modal emotional dynamics in a conversation,” in EMNLP, 2021, pp. 2694–2704.
  64. “EmoCaps: Emotion capsule based model for conversational emotion recognition,” in Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 2022, pp. 1610–1618, Association for Computational Linguistics.
  65. “Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations.,” in IJCAI, 2019, pp. 5415–5421.
  66. “Hitrans: A transformer-based context-and speaker-sensitive model for emotion detection in conversations,” in Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 4190–4200.
  67. “Multi-level multiple attentions for contextual multimodal sentiment analysis,” in ICDM. IEEE, 2017, pp. 1033–1038.
  68. “Tensor fusion network for multimodal sentiment analysis,” in EMNLP, 2017, pp. 1103–1114.
  69. “Multimodal sentiment analysis with word-level fusion and reinforcement learning,” in ICMI, 2017, pp. 163–171.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Soumya Dutta (20 papers)
  2. Sriram Ganapathy (72 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.