Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus

Published 12 Jun 2023 in cs.CL, cs.SD, and eess.AS | (2306.07115v1)

Abstract: The emotion detection technology to enhance human decision-making is an important research issue for real-world applications, but real-life emotion datasets are relatively rare and small. The experiments conducted in this paper use the CEMO, which was collected in a French emergency call center. Two pre-trained models based on speech and text were fine-tuned for speech emotion recognition. Using pre-trained Transformer encoders mitigates our data's limited and sparse nature. This paper explores the different fusion strategies of these modality-specific models. In particular, fusions with and without cross-attention mechanisms were tested to gather the most relevant information from both the speech and text encoders. We show that multimodal fusion brings an absolute gain of 4-9% with respect to either single modality and that the Symmetric multi-headed cross-attention mechanism performed better than late classical fusion approaches. Our experiments also suggest that for the real-life CEMO corpus, the audio component encodes more emotive information than the textual one.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. “Multi-layer Dialogue Annotation for Automated Multilingual Customer Service,” ISLE, 2003.
  2. “Challenges in real-life emotion annotation and machine learning based detection,” Neural networks: INNS, 2005.
  3. “LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech,” in INTERSPEECH, 2021.
  4. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Advances in Neural Inform. Process. Systems, 2020.
  5. “FlauBERT: Unsupervised Lang. Model Pre-training for French,” in Twelfth Lang. Resources and Evaluation Conf., May 2020.
  6. “BERT: Pre-training of Deep Bidirectional Transformers for Lang. Understanding,” in NAACL: Human Lang. Technologies, 2019.
  7. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap, 2022.
  8. Transformer Models for Text-based Emotion Detection: A Review of BERT-based Approaches, AI Review, 2021.
  9. “Attention is All you Need,” in Advances in Neural Inform. Process. Systems, 2017.
  10. “Long Short-Term Memory-Networks for Machine Reading,” in Proc. of Empirical Methods in Natural Lang. Process., 2016.
  11. “Neural machine trans. by jointly learning to align and translate,” in ICLR, 2015.
  12. “Speech Emotion Classification Using Attention-Based LSTM,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.
  13. “Self-attention transfer networks for speech emotion recognition,” Virtual Reality & Intelligent Hardware, 2021.
  14. “Multimodal Machine Learning: A Survey and Taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  15. “Speech Emotion Recognition Using Semantic Inform.,” in ICASSP, 2021.
  16. “Self-Attention for Speech Emotion Recognition,” in INTERSPEECH, 2019, pp. 2578–2582.
  17. “EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings,” IEEE Trans. on Affect. Comput., 2021.
  18. Low-Level Fusion of Audio, Video Feature for Multi-Modal Emotion Recognition., 2008.
  19. Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Inform., 2004.
  20. “Strength Modelling for Real-World Automatic Continuous Affect Recognition from Audiovisual Signals,” Image and Vision Computing, pp. 76–86, 2017.
  21. “Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition,” in Proc. of the First International Workshop on Natural Lang. Process. Beyond Text, 2020, pp. 1–10.
  22. “Efficient Low-rank Multimodal Fusion With Modality-Specific Factors,” in Proc. of the 56th ACL, 2018, pp. 2247–2256.
  23. “A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions,” IEEE transactions on pattern analysis and machine intelligence, IEEE Trans. on Pattern Analysis and Machine Int., 2009.
  24. “Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network,” IEEE Access, 2020.
  25. “End-to-end multimodal affect recognition in real-world environments,” Inform. Fusion, 2021.
  26. “Prediction-based learning for continuous emotion recognition in speech,” in IEEE ICASSP, 2017.
  27. “Attention Bottlenecks for Multimodal Fusion,” in Advances in Neural Inform. Process. Systems, 2021.
  28. “Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition,” in IEEE ICASSP, 2021.
  29. “End-to-End Speech Emotion Recognition: Challenges of Real-Life Emergency Call Centers Data Recordings,” in ACII, 2021.
  30. “Investigating Transformer Encoders and Fusion Strategies for Speech Emotion Recognition in Emergency Call Center Conversations.,” in ICMI, 2022.
  31. “Neural Machine Trans. of Rare Words with Subword Units,” in Proc. of the 54th ACL, 2016, pp. 1715–1725.
  32. “Adam: A Method for Stochastic Optimization,” ICLR, 2015.
Citations (8)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.